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iDBBTfPIBRS 
ABSTRACT 

/ This docnaent answers the question, "fhat has been 

leatned about evalaation aethodology froa the decade of coapensatory 
^dttfcition"? Ten issues dealing »ith -tha evaluation of Title I aece 
identified uithin a general theoretical frasea^rk of evaluation. Por 
eabh issue it was the aia of this docayent to 4* the following: 1)to 
clarify the issue, 2) to point out exaaples in which it is crucial, 
3j"to present and evaluate arguaents on different sides of the issue, 
«nd «i)to suggest resolutions of the issue. Each of the issues was 
Elected because its resolution is a necessary step in the 
/developaent of a rational Title I. evaluation policy. The issues 
/addressed are: 1)To whht should Title I treataents be cospared? 2) Is 
loagitudinal evaluation necessary? 3)Ihen is representative saapUng 
iaportant? «)How large a saaple is necessary? 5)ihat constructs 
should be aeasured to deteraine Title I iapact? «lihat types of 
achieveaent aeasureaent instruaents should, be used in Title I 
evaluation? 7)iha,t units of aeasureaent should be used? 8) ihat are 
the conditions for vaUd coaparisons between nonequivalent treataent 
and coaparison groups? 9)0nder what conditions can relationships of 
Title I costs and treataents to effectiveness be inferred? 10)How^ 
should data be aggregated across projects in Title I evaluations/ 
(Author/Atf) 
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Preface 

In accordance with the objectives of contract ^400-76-129 between the 
National Institute of Education and the American' Institutes for Research, 
this document was produced to present as clearly as ppssible to not!-methodol- 
ogists the source^ of controversy surrounding the decade of evaluation of 
Title I of t|^ Elementary and Secondary Education Act of 1^65. At the begin- 
ning cf October 1976, the authors "set out to write (1) this methodological 
discussion, (2) a volume of summaries of Title I evaluation studies, and , 
(3) a synthesis of the substantive findings about Title I that have been 
p>rovided by those studies. Allotment of^ time to the three documents was a 
constant problem during the nine-month period of the contract, because the 
direct expenses' covered only about 10 person-months of professional effort. 
This document represents about 40% of that effort. r 

The authors intended to produce a document that would .serve as the first 
draft of an introductory textbook for educational evaluation. While this 
goal would, we feel, fulfill a real need, it has proven more difficult than 
expected to explain complex"" problems simply, ^nd we would welcome any readers 
suggestions on how to do that better. The current document contains vety 
little algebraic notation; however, laji^es into undefined technical jargon 
can be frustrating to readers who are completely unfamiliar with the subject 
matter, and we cannot guarantee to have eliminated all sucfi lapses. 

The first draft manuscript for this document was produced in February 
1977 to delimit the scope of the task. It was circulated among the authors 
and to the NIE project monitor* Alison Wolf, whose comments on thxs draft 
were very helpful. A second draft was produced- In March 1977 and circulated 
to several revxewers. These reviewers were exceptional in their donation of 
time to this endeavor and the sophistication of the feedback they provided. 
They were Michael Wargo, now the Director of the Evaluation Division of the 
Office of Policy and Pl^^nning* in ACTION: G. Kasteti Tallmadge of RMC Research 
Corporation; Jane David of the Educational Policy Research Center at the 
Stanford Research Institute; Alison Wolf and Joy Frechtling of NIE; and from 
the staff of AIR, William Cleinans, William Shammer, and Marlon Shaycoft. We 
ar^ deeply grateful to these individuals for their efforts; hpwever, because 
we did not follow their ^counsel in every case, we accept responsibility for 
any faults that remain in this document. 

1 



■ The third draft of this document, essentially that, which is presented , 
here, ^as completed in early July 1977, and was sent to -NIE for final approval 
prior to printing. The authors wish to express their special thanks to 
Alison Wolf for her understanding of th4 budgetary and temporal constraints 
tQvolv«* in producing tKls document as well as for her cogent advice- on , 
Improving the content anl format of the document^ 

Finally, we are very grateful for the exceptional efforts of Ma. Emily 
Campbell, who gracefully accented our missed interim deadlines and efficiently 
turned our manuscript into a presentable document through her typing/editing 
expertise. 



Donald H., McLaughlin 
Kevin J. Gilmartin 
Robert J. Rossi . 
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•Jl^NTROVERSIES IN THE VALUATION OF COMPENSATORS EDUCArfON 



Introduction 



\ 



Since the middle 603, many billions of dollars have'^een- allocated 



throligh the federal government to socir.l action programs, and many millions 
havj^ been spent on the' evaluation of these programs. In particular, the 15 
b^lion dollars , that have been spent on compensatory edudation through T^tle I 
<it the Elementary and Secondary Education Act of 1965 have been accompanied 
by a continual stream of evaluation efforts. As ha^ been pointed out by ^ 
several authors, program evaluation is in a sense an adversary^of program 
operation, and throughout .'-he last decade there has b^feen a great deal of 
criticism of programs by evaluations and also criticism of the evaluations by 
proponents ot the programs ^ In this doSameirt, we would like to set forth the 
exit ical^ issues in the evaluation of compensatory education and attempt to 
■^pply the reader with an understanding of the complexities involved so that 
he or she can judge how and why .to do evaluations as well as th^valldity 
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of Others' evaluations, 
ri 

r 



- " ^ -f ... 

The crucial issue, as set forth in the classic argument between DonalcJ 
Canpbell and John-Evans (CampJ.ell and Eflebacher, 1970; Evms, 1970) is- 
whether evaluations should be don^; perfectly or not at all,' or should be done 
as well as possible in each situation. On the one 'hand, federal programs 
will inevitably be evaluated during congressional subcoitaittee pre^sentations, - 
whether based on quantitative data of on anecdotes, so it se^ms prudent to 
provide as much valid, objective, representative information as ^possible to 
our policy decision-nalfers . On the other hand, evaluation carried out by 
credentialed scientific organizations and academic institutions carries .some 
weight thereby and correspondingly reflects on' their reputations, and pro^^iding 
the stamp of scientifiq integrity to a compromised evaluation may result in 
the end in the debasing of the scientific method.. If an evaluation must ^ 
itself be evaluated before acceptance (Scriven, 1976), the resulting infinite 
regression ensures the lack of value of evaluation as ? pool in policy-making... 
Evaluations must be carried out by proficient investigators with proper objec- 
tives, and the audience of the repbrts must be sufficiently aware of the 
issues in' evaluation to .judge for themselves that the evaluations are per- 
formed acceptably. , 



-^/fc--Tt<i' iMuea dealitt^-^th the tivaiuatiofi ©t Title I have been identified,.* , 
within general theoretical framework of evaluation. For ^ach issue, it is ' 
the als of bhis document CD to clarify the "issue, (i) to point^ut examples 
In uhlch it iA crucial, (3) to present and evaluate argumeixts on different . 
sides of the issiie, and (4) to suggest resolutions of the issue. «ach' of the 
Is^es was selected because its resolution is a .necessary step in this develop- , 
nent of a rational Title I evaluation policy. 

This document discusses evaluation in the context of decision-making with-* 
in a rational planning system. As show in Figure 1, the system has four «pri- 
mary components; decisions f rationaj^esy Information ^^nd gathering (of inform 
matian). Decisions have rationales; and information is in turn gathered 
test and validate these rati<^^les. The term "evaluatloj^ .can refer to either 
the total system or j^ubsets pf if, although it is urually limited to the 
gathering of infonaation. Tte decision-theoretic approach^ (Ed^iftds, Guttentag, 
and Snapper, 1975) is the clearest example of /^ijie widest scope of eviltmtion^ln 
in this framework. From that viewpoint, the task df a program evrflurftlon 
Inclu^des, among other things, the analysis of the (decision process, and ±h 
particular, the quaniitative determination of values that affect decisions. ^ 
Strict adherence to this framework would exclude from consideration research ^ 
studies whose produfit is n<l>t related to deci&ion^ (e.g., baHc research tfo 
determine the nature of educational disadvantage) and studies called **evalua- . 
tioils" but undertaken for extraneous purposes (discussed by Floden and Velner, . 
1976). However, that fact will not preclude tfie discussion of such studies j^fts, * 
they relate to Title I in this document. 5^ • O . 

* ^ The separation of the "our primary pomponents of ratiotlal planning is an; 
Important step in the identification of different types of evaluations. Eval-** 
uations can be characterized by the types of decisions to' be made, the types 
of rationales ad>^ccd for them, th^ types of information relevant to testing 
and validating ^he rationa^^ and the ways of gathering the inf (frmation. The 
most -notable distinction of eS^aluations in terms of decision type is the form- 
at ive-summalfive dichotomy (Striven, 1967); accordin|/to that dichotomy, lnfor-» 
mation gathered in evaluation can be used either, to Improve the process Wal- 
uated (formative evaluation) or to support a decision of whether to make further 
Investment in the process evaluated (4H™ative evaluation). That distinction 
aff^'cts all components to the extent that the type of decision determines the 
^ationalesv, the information needed , \nd the appropriateness of ways of gather- 
ing, information • I {} 



. * 



DECISIOHS • • : 

« • * 

. are functione of * 

attributes of the decision-makers . 
dvai^ability of alternative chpicef 



I 



rationales 



— 




FATIONALES " 

are based oa-tfhe relations ^ ' 
of choi<lep to outcomes and of 
.'outcomes to vilues^* They require , 

Information^ , w 



INFOIQIAIION 

. about a prograsf ,:an be of 4 tyi^es:^ 
context (needs, disposing oot^itibis) 

,lnpu>s (funda^ tegulations) ^^-^ 
pror-^ses (servige delfvjery) ' • 
ptoducts (outcomes,- Impact) • ' \ . ^ 
Information must be • _i 



t gathered. 



INFORMATION GATHERING ^ . • 

consists of 4 phases: 
• design (operationalizatlon of rationale) / 
sailpllng' (ensuring generalizability) ' 
measureitfent (ensuring reLevancer-vallditty, 

reliability) . - • - 

ailalysis^ (transient ion from data to test 

of rationales) 



Figure 1. Schematic idl^gr;^ of thn framework of evaluation. 



A, plausible rationale for any decision must take the form of an argument 
that the value of .the expected outcome giy^ one choice is igrfeater than the 
'value*^ <Jf expected outcomes^ ^iven ot;her choices.* Independently of whether the 
link drawn between a decision §nd latex: outcomi is -correct, there^'can be sub- 
stfttitial disagreement about which aspects of outcomes ^are to be considered. A 
good deal of confroversy^over evaluation stems from this fact. Thk need for 
Information gathering arises when a ra^onale contains an empirically testable 
statement whose truth is in question/(ft.g. , statements like '«if we can get the 
moneys "translh ted into smallet; student : teacher^ ratios, achievement gain will 



grw"). 



' . •Th«.£o'« types' of Information shown in Figure 1 correspond to the fout 
types of. evAluatlon identified by Stuff lebeam (1971) ^d referred' to frequently^ 
the CTPP model of evaluation. ' -Information relevant to a particular decision 
rationale may'pertain to a program's context (e.g. , the needs and abilities o£ 
the target group). ?o its inputs (e.g.. the funding pattern arid regulations), to 
■ its processes (e.g.. the selection of participants and of treatment methods and 
'the Inplementation Of treatments), and to its £r°t'^c«^?-» °^ outcomes. The ^ 

- ' products of. a program to be evaluated can be expected to vary along a pfd|cimal- 
distal continuum: proximal outcomes tend to br mor,e under the dontrol of t;he 
prbgram to affect and less subject to contextuU fa'ctors .* whereas distal outcomes 

. t^nd to be mose clearly relatS to values which programs-are hoped to achieve.. 

- Jtufflebean pointed out the ways in which each of T:he four type's of information 
. ■ is'- especially important for a particular decision type; of course, the four type. 

. ^' 'of information are useful in combination for many decisions. 

' - The four aspects of gathering information form' the methodological substance 
of most evaluations of federal programs, as reported by the researchers who . 
carried out the studies. The methodologtc'al 'issues to be discussed in' the pres- 
ent document will be presented in four sections cot responding' to these aspects 
of ^formation gathering. ^Design issues refer to problems in the general plana 
foi;' testing of decision rationales. In many actual cases, the rationales to be 
tested have not been made explicit and can only be inferred from the nature of > 
: the report's conclusions. Sampling issues refer to problems in generalizing to 
^ a popul*tiori. ahd they concern the size, representativeness, aod units of the 

sample. Measurement issues refer to problems in translating fundamental prograa. 
f concepts (e.g.. "educational disadvan.tage") into instrumeivts tt^saess the con-. 
• ■ cepts and to problems 'in assigtii-ent of numerical scores to ^^orded bebay-., 
lors (scall!ig). Finall'K analysis issues refer to problems in isolating and 

explaining particular relation^ in the data. 

: Before launching into the discussions of methodological- issues, we shall 
provide some context by expanding the general evaluation framewo'rk of figure 1 
ds it applies to federal studies of compensatory education. Each of the issues , 
' in^the four areas will be discussed abstractly, as it pertains to any poten- 
tial evaluations of Title I^and it will al^o Jbe^Jliscus^n terms of spe- 
~ clfic past eval-^tions for which it is relevant. The/^^Hision of particular 
projects as examples in the discussions, will take the p^o.iects out of con- ^ 
text, however, and readers should not consider these discussions to constitute 

- ■ . . , ' 
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evaluations of the projects. Finding that a project.^as some methodological 
^veakness may not diminish the Importance of many of Its conclusions, espec- 
ially for stGdles that address many different aspects of Title I. 

* 

Decisions and Rationales In Title I 

The primary decision-makers in the Title I system are Congress, the U.S. 
Office of Education (USOE), local school administrators, and ceachers. Al- 
though each of these groups is far from monolithic and makes numerous ' 
• diverge decisions that affect the operation of Title I, It is helpful go ^.ay 
out nine of the major decision types they address and the rationales and 
iQfortiation needs for them: four decisions by Congress, three \by USOE, and 
one each by' loc^l administrators and teachers. 

^ 1. Congress decides whether to increase ^e appropriation level . There 
ate at leaat three basic rationales for Increasing |und^: (a) the 
program is reaching only some of the intended target population, 
it is helping those reached, and the reason it is not reaching 
others is because there are too few funds; (b) an effective method 
for solution has been found, but its typical per-pupil cost of 
implementation is higher than the typical expenditure allotted to 
each participant; or (c) increased costs for the same service^ 
require increased expenditures. Although the reason any particular 
member of Congress votes to increase Title I appropriations is a 
complex function of competing forces that may involve decisions on 
other appropriations completely unrelated to compensatory education, 
any decision to increase Title I funding must be accompanied by a' 
% rationale such as *-hose listed — otherwise, it can be attacked as 
irrational or as an instance o^ "boondoggling." 

Even though we cannot hope to compile a complete set of rationales 
here^ those that are included serve to identify the types of infor- 
mation needed. For the' first rationale to be useful. Congress must 
know who the target population is and what the discrepancies are 
betw^n the target population (educationally and economically dis- 
I aJvafttaged children) and the participant population. They must a^^o 

know whether Title I is helping those it reaches. For the second 
''rationale. Congress thust know of methods found to be effective and 



and capable of being widely utilized, their cost^s, and typical per- 
pupil allocations. For the third rationale*, Congress must know 
how inflation contributes to the costs of compensatory education 
services and what the effects would be of "holding the line on 
spenaj.ng."* • • 

A comprehensive evaluation of Title I would aim to provide Concress 
with the information necessary to test the validity of the various 
rationales. Due to constraints of time and effort, however, evalu- 
.ations normally provide only partial validation of rationales, 
which, although it is useful, leaves significant gaps to be filled 
by faith. An example related to the first rationale above would 
be a study that demonstrated that substantial numbers of disadvan- 
taged children were not being served, but failed to demonstrate 
that the children who were served benefited from the service. When- 
ever it is infeasible to close the informational gaps completely, 
an evaluation will be most useful when it addresses the gaps with 
whatever information is, available. 

In discussing this first decision type, we have tried to exp'ain 
some of the problems that arise in relating decision-making to 
information-gathering. These apply als(^ to the remaining nine 
decision types, although they will not be presented in eqpal detail. 

:.. Congress decides whether to decrea se the appropriation level. The . 
ratiotJales for decreasing spending are not merely the inverses of 
the rationales for increasing spending. Two rationales for this 
decision might be (a) that funds are being used for services for 
people other than disadvantaged children or (b) that the need fo- 
a federal compensatory education program had diminished. Another 
possible rationale, that although the need persists the program is 
not dealing with it, is an argument for changing the'program, not 
reducing its funding level. To te.^t the two rationales for decreas- 
ing funding, the necessary information includes the distribution 
of compensatory education needs and services throughout the tountry. 



*Thls third rationale is relatively weak, because it can be applied to ^11 
'appropriations. ^ 



3. Constress might modify the funding allocation formula . The rationale 
for this decision might be either (a) that the children served by 
'Title I are not exactly those for wtiom the program was Intended or 
(b) that the nature of th^ need served by the program Is modified. 
Again, the necessary information concerns the distribution of cota- 
pensatory education needs and services, but possibly with emphasis 
on variations among needs and services. 

4. Congress might modify or add a rule conce rning the use of program 
funds. The rationale for this decision would be the identification 
of a problem that reduces the effectiveness of the program and a 
general method for eliminating or reducing the frequency of that 
problem. The information needed for this type of decljslon is there- 
fore evidence that. a particular unintended process frequently occurs 
in implementation of the program and that this process reduces pro- 
gran^ effectiveness . The latter type of evidence is necessary in 

. order to avoid eliminating effective processes, and its validity 
depends upon the demonstration of causal linkages, not merely cor- 
relations: it is quite Hkely that the program will be more effec- 
tive in some situations than in others but the situation is not the 
cause of the effectiveness. Another type of evidence, that a par- 
ticular mod^fieation to the law will deal effectively with the prob- 
} lem, is unlikely to be available before the modification is made, 
but can be obtained after the modification by comparing the preva- 
[ lence of the problem before and after the modification. Further 

modification can then be made. ^ 
Turning now to the U.S. Office of Education, we have three more major 
decision situations. One of these is essentially the same as the congres- 
sional decision to modify or add a rule. 

5. USOE might modify or add a rulfa concernin g the use of program .funds. 
That rule might be in. the fotfm of a regulation (with the status o£^ 
a legal requirement) or a guideline (a formal suggestion for proce- 
dures). The rationale and evidence necessary for such a decision 
would be the same as for the analogous congressional decision. 

6. USOE may decide to disa p prove a state' si' application for it s annual 
allotment of funds or to request r eturn 'bf funds. The rationale 
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for this decision would be that the state is not complying with the • 
law and regulations, that its noncompliance reduces the effective- 
ness of the program, and that punitive action would be likely to 
impro^ program performance. Evidence needed for this rationale 
concerns processes and outcomes within particular states and local 
districts, rather than national averages. It also concerns whether, 
punitive action will deal with the problem, whick, except for gen- 
eralization from other federal programs, can only be determined 
after the action is taken. 

7. USOE may decide to provide tecnnical assistance . In fact, that 
decision may be incorporated into the law by Congress, in the 
case of the instructidnf to USOE to provide technical assistance 
to states and local districts in the preparation of their annual 
Title I evaluation reports. The rationale for such b decision is 
that there is a clear and pervasive problem that cannot be dealt 
with through regulations, because states arid local districts do 
not have the capability for solving the problem. In the case of 
annua.^ ev£.a.uation reports, a substantial part of the problem is 
that data are presented in such varied forms that aggregation across 
stdtes to form a national program assessment has been impossible; 
technical assistance has aimed to promote uniformity of reporting, 
among other things. 

The information needed in order to implement a technical assistance 
program includes not only evidence of a problem but also informa- 
tion concerning proper methods for carrying out processes, and this 
information need requires reseg^rch and development efforts that go 
beyond the usual type of evaluative information gathering* The 
area in which there is greatest need for technical assistance within 
Title I is the specif ication of effective methods for compensatory 
Instruction, and in order to provide this assistance USOE has under- 
Caken, among other things, to discover effective methods* 
The many decisions involving actual delivery of compensatory education 
are mde at the local level. The participation of state education agencies in 
the decision-making process varies greatly among the states and contributes to 
the local decision-making effort* 



8. Local school admin iatrators decide upon particular expenditures of 
Title I funds . The rationale for a choice among alternative projects 
would include information concerning which methods will generate the 
greatest redu<rtion in educational disadvantage in the context of the 
local schools V Two forms of ^this information are (a) the results 
of careful research on compensatory education coupled with knowledge 
about the effects of the special context of the local district on 
compensatory education effectiveness or (b) finding that the methods 
V used previously in the district's. schools were satisfactory accord- 
ing tp local standards. It is the purpose of local evaluations to 
^ provide the latter type of information; the general lack of valid- 

evidence of effectiveness of locally developed methods provides the 
justification lor technical assistance from the federal, government 
in the form of disseminating information about effective methods* 

9. The teacher of a compensatory education participa nt, besides desig- 
nating him/her for participation, makes day -to-day decisions on the 
form and content of compensatory instruction that for the child are 
at le- Jt as important as any (pther decisions made in the system. 
Although these decisiohs have their rationales, the rationales are 
most frequently not clearly understood. It is an objective of cur- 
riculum packages to provide the decision rules, (for example, in 
individualized instruction) that will enhance the child's achieve- 
ment. Those decision rules are (ideally) the result of validation 
o^ rationales based on student performance during the development 
of the curriculum package^. 
Although many other decisions might be included, these nine provide a 
basis for the specification of the prim information needi for Title I 
decision-making. Although it is the purpose of evaluation, generally, to 
meet these information needs, any particular evaluation project will meet 
only one or a few of the needs. An overall strategy is needed that would 
meet all the needs efficiently. The collection of information need not be 
related to decisions in a one-to-one fashion; not only are many decisions 
made simultaneously or in overlapping time periods, but certain types of 
information call for similar evaluation paradi^, some for different para- 
digms. 
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Information Required In Title I Evaluations 

At the Inception of Title I, Information needs hS^not been clearly 
differentiated, and Information gathering efforts designed to satisfy Impre- 
cise forms of all Information needs at once were undertaken. As reported 
by Zlmlles (1970), Wargo, Tallmadge, Michaels, Llpe, and Morris (1972), and 
McLaughlin (1975), the first five years of evaluation of Title I were essen- 
tially a total l©ss in terms of achieving any of the valid objectives for 
evaluatlori. In recent years, there has been greater dlf ferentlatl9)i of roles 
and objectives within the federal educational evaluation bureaucra<y^ and 
efforts' such as the Descriptive Study of Compensatory Reading Programs 
(Trlsmen, Waller, and Wilder, 1975), the PIPs dissemination strategy (Stearns, 
1977), the technical assistance centers and evaluation pa^rkages to help 
states 4nd local districts carry out evaluations (Wood, Cannara, Fagan, and 
J Tallmadge ♦ 1976), and currently ongoing efforts funded through th€i Office 
of Education (the Sustaining Effects Study, System Development Corj^oration, 
1976) and the National Institute of Education (the overall Title I assess- 
ment. National Institute of Education, 1976) are evidence of movement towards 
more realistic relations between objectives and operations in evaluations. 

There are seven basic categories of information needed to test tht 
rationales listed above. Various combinations of two and three categories 
of information, when properly- analyzed, yield the required tests. The rela- 
tions of the seven categories and their combinations to the rationales are- 



shown in Table 1. Information on target and participant populatiofts and on 
costs is "conttfxt" information; information on resource allocation is "input" 
information;^^ information on management and services is "process" information; 
and information on effectiveness is "product" information. While tlfe reade«r 
nay disagree with some of the specific entries in this table, the iitportant ~ 
point is that such a relational table is a proper foundation fof the develop- 
aent of a comprehensive evaluation strategy. .Understanding how the informa- 
tion is to be used provides an important input to choices of ways of gathering 
the information (^.g., what populations the sample must represent and what 
particular details of information should be included in measurement instru- 

\ ■ 

ments). 

In order to provide this foundation, it is necessary to address four 
"systemic" questions, questions that concern the principles of the system's 
operat.or These may be addressed either as part of an evaluation project 
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Table 1 

Categories of Infonnation Required of Title I Evaluations 



Reeded to Test 

Category of Information * Rationale* 

1, Target Popt^lation (level and frequency of needs; 

other characteri^ics) 2b, 3b 

^ and Participant Population 2a, 3A 

and Participant Population and Allocation Process 3a 

and^Costs <» la, 3a 

and Effectiveness ^ \_ , o ^ 

2, Participant Population (numbers, per-pupil allocations, 
other characteristics) 

and Services and Effectiveness 7 

^ and Costs and Effectiveness ,1^ 

3, Resource Allocation Process (selectiot^ of participants) 3a, 6 

4, Local School Management Process (parental involvement, 

evaluation, project design) ^» ^ T 

and Effectiveness 4, 5, 6 

5, Services (processes, agents, contents, settings) 



and Co&ts 



and Effectiveness 



la, 8 



and Costs and Effectiveness 7» ^ 



4, 5,^7, 8, 9 



6* Costs (resources needed for tlelivering compensatory 
education) 

and Effectiveness ' lb, 4» 5 

7. Effectiveness (changes in pupils' school performance) 8 



*The rationales are numbered to match the presentation in the text. For 
example, "2b" refers to Rationale b for Decision 2. 
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or as a precondition to the design of evaluation projects. These four ques- 
tions are: 

1. What operations are intended to occur in the Title I system and 
how do they Interrellfce? In order that information gathered bfe 
relevant to decision-making, there mu^t be a clear understanding 
of How the system is supposed to functionTin greater-detail than 
expressed in the law* For example, the meaning of "^economic dis- 
advantage," -evaluation," and ^supplementary services" must be 
translated into specific observable events, if empirical observa- 
tions are to be related to the program's principles. 

2. What assumptions about society and huma n behavior are incorporated 
into the Title I system? For example, there would appear to be an 
assumption that economic disadvantage is a source of problems in 
schools that monty can remedy. There also appears to be an assump- 
tion that children, once brought up to the ability levels pf their 
classmates, will benefit ftom regular school instruction is much 

as their peers. Such assumptions must be separated from hypotheses 
about process-effectiveness in order that evaluation outcomes can 
be interpreted appropriately. In other words, the testing of 
rationales for decision^making should be undertaken with a clear 
awareness of the presuppositions inherent in those rationales. 
3. Wh^f^are the o b jectives of t he program? For .example, there needs 
to be a clarification of the types of impact on students that are 
to be considered as justifying Title I expenditures. Do these 
include cognitive skills beyond reading? Do they Include attitudes 
and self-concepts? Do they Include the physical well;belng of the 
student? As andther example, there heeds to be clarification of 
the intended impact of Title I on the administration of local 
school districts'. Should it include generally greater emphasis 
on promoting equality among all students or greater emphasis on 
evaHiation and planning in school programs or more careful diag- - 
nosis of individual students' special needs? Also, to what extent 
is the objective of the program the mere transfer of funds to 
impoverished school districts? Mistaken assumptions about a pro- 
gram's real objectives will lead to usaless recommendations; an 
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evaluator who understands "hcse objectives can provide a more pro- 

9 

found interpretation of his/her data. 

4. What are the relative values of different program outcomes? This 
Is a refinement quantification of the preceding question. Not 
ocly does'the range of objectives need to be identified, but also 
ther^e must fee some estimate of the relative importance of different 
^ ~~ outc o m ga. - F<?r ^y^n'l^i to decide what percentage of a <listrict's 
Title I funds should he spent on students in grades 1 through 3, 
it i^ useful to have some estimate of the value of coii^nsatory 
education for children of different ages, baaed on a. comptehensive 
theory of education. Likewise, for an evaluator to compar^ the 
benefits of different projects that achieve different goals, a 
quantitative measure of those achievements is necessary. 

An argument against including a^yatemic questions in an evaluation frame- 
work is that they are beyond the province of evaluators and are to be decided 
throu^ political negotiation, .logic, and common sense. As the work of many 
psyfchologists has shown, however, these processes are themselves subject to 
principles of human behavior that can be studied and improved upon. The use 
of scaling technique^ to arrive at consensus value:: and the use pf the re- 
search literature on social processes and human learning to identify the 
assumptions in the system are two instances in which systemic studies might 
well supplement the often bias-laden hunlan processes such as political nego- 
tiation. Edwards, Guttentagr- and Snapper (1975), have elaborated specific 
methods for dealing with some systemic questions in evaluation. 

The general arguments for -including systemic questions in the evaluation 
framework are a) that "otherwise they quite likely are not answered and the 
meaningfulness of the tests of decision rationales is therefore severely 
reduced, and (2) answers to systemic questions are, more likely to represent 
the views of society at large if arrived at through systematic, replicable 
(I.e., scientific) methods. - ^-"""^ 

specific argtanents can be medi|against forcing an evaluation to charac- 
terize the system in cerms of a single set of objectives and outcome values. 
For one thing, the resuXtl^ig set would oversimplify the situation. An impor- 
tant aspect of Tttl^ I is the multiplicity of goals of the program as viewed 
fey citizens ^ 'different^ situations. BycVot delineating the operational 

* ^ / -21 \ 
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obj«ctlv«s of the program precisely. Congress can forge a.coalition of^con- 
•tltucncles in favor of. the" program ,that might collapse if all t^he jpbjectiveg^^ 

, " iMn w«ll specified. Any aethod of establishing objectives and values must , 
not l»«v« the effect of forcing a collapse of the coalition. This is not an 
InsursBuntable barrier to rational decision-making, howeve^t. Systematically . 

" «^«bUflhing^e valiie dimensions for various outcomes of Title I wonld pro- 
vide • much needed foundatira for addressing fundamental evaluation issues, 
•uch a« how to a,cale and aggregate achievement gains. 

Inf orwatlon Gathering Processes 

The aspect of r«Jtlonal decision-making most frequently referred to as 
• "evaluation" is the gathering of information, or as it has recently beeft 
called, "the production of knowledge." For most evaluation specialists, the 
•rea gf their training aiid technological" expertise is in gathering reliable 
and valid information, and the choice among evaluators for a particular j)roj- 
•ct 'usually depends on the demonstration of that expertise." Although evai-- 
uators are wise to be aware of the points discussed in the preceding section 
(i.e., how the information they <tather is to be used), their primary respon- 
sibility is for gathering the informa'tj^li. In keeping with this'concept of 
evaluation, the methodological issues^dWissed in this document relate 
^•primarily to information gathering, although, the context of the information- 
gathering.Vill be seen to modulate the issues. (One general heuristic for 
this is that the evaluation of information gathering, • like any other activity, 
•hould take into consideration tfe objectives of that activity. Another is 
that whenever you find you cannot gather a particular type ot information, 

you should ask whether you really need it. * • 

As Set- forth in Figure 1, we can'view information gathering as consist- 
ing of four components: design, sampling, measurement, and.analysis. The 
iiisue. addressed in the subsequent sections are, in fact, grouped according 
• "to these categories. ' 

The first of the co^onents, design , is the most difficult .o delimit. 
It la the iJlanning process, the development^ or selection of a framework for- 
inforfcatioli gathering. Thus , it overlaps the other three^components: the 
detailed specification of the sampling, measurement, and analyses components 
would in fact include the total content of the design of information gather- 
ing. There are, however, three design factors that trariecend the other 
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cofj^nents: (1) the general frame of reference, (2) , the specific^ design 
Model, and (3) the longitudinallty of measurement. Much of the controversy 
concjeming Title I evaluations has centered on the specific design models 
UBei^^ In particular, comparisons of performance of nonequivalent groups 
have^ proven faulty. The first design issue to,^ discussed will consider 
a possible resolution of that controversy by-way of changing the evaluation ^ 
frane of reference. The other design issue to be discussed pertains to the 
validity of evaluations based on gains ^within a single school year, a ques- 
tlon of longitudlnality . 

The second component, sampling , refers to the specif ication* of rules for ^ 
selectftig which states, districts, schools, projects, classrooms, or children 
to collect data from in order to reach general conclusions! Of the two san^ 
pllng issues dj.8cussed> the first will focus on the impact of having nonrep- 
reaentative samples on the validity of the information providedj^ and the 
^second will focus. on ^the necessary size of samples to be used in evaluation. 

The third component, measurement , has also been a center of contrc grsy. 
There are three factors in 'the specification of measurements in evaluations: 
CI) the selection of which constructs to measure, (2) the selection, or 
development, of instruments (e.g., achievement tests) to ,make .the measure- 
ments, and (3)' the scoring, or scaling* of responses on the measuring instru- 
ment. Pertaining to the first factor are issues of how general the achieve- 
ment gains are to be. Thesfe issues border on substantive issues of ;/hat the 
objectives of Title I should be; however, they also involve methodological 
issues of how to measure cognitive growth. Pertaining to the second factor 
is the issue of the role of criterion- and norm-referenced tests in evaluatidri 
of compensatory education^rand^peTtalning to the third factor is the perva- 
sive issue of the imits of measurementr In particular, the role of grade- 
equivalent, scores. ' ^ . 

The fourth component of information gathering is analysis. Analysis 
has as its purpose the transformation of measures of sampled individuals 
Into information relevant to rationales for decision-making. More particu- 
larly, the results of analysis are assignments of the likely truth of par- 
ticular statements that contribute to rationales^ (e. g. , ''the likelihood that 
children would have learned this much in the absence of the program is less 
than 1 in 100"). The most salieut issues concerting analysis are CD whether 
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ther* are adequate analytical method's for comparison between nonequivalent \ 
groups,^ and in particulat, whether variants of analysis of covarlance are 
appropriate; (2) whether analyses used to determine relations between effec- 
tiveness and cy^sts and =8ervices are. appropriate; and (3) whether methods 
used to aggregate data from different sources (e.g., annual state reports) ^ 
are appropriate. 

Realities of Evalikatlon ' . * ^ 

-I ' — • 

* An important characteristic of evaluation is 'that it is a process in- 
volvlng people, and therefore the assumption that tn reality evaluation con- 
forms to some simple model, such as is presented here, will miss, a large 
part of the true nature of evaluation. First, the purpose for which infor- 
mation is g^hered does not directly affect whether it is reliable-or valid; 
and large numbers of studies in the research literature are subject to the 
^sante methodological criticisms that are directed at f ed<fral evalviations of . 
compensatory education. There are, however, two fundamental tea^ons why 
methodological problems appear to be more prevalent in the f ederal\,evai\ia- 
tloii. studies than elsewhere: (1) the results of^ the studies are of Substan- 
tial importance ttf the lives of many people, so they are subjected to ^ore 
intense scrutiny than are less sensitive research projects; and (2) reqVire- 
nents and constraints on information gathering are to a great extent speci- 
fied by individuals with expertise in the use of information in^decision 
rationales but not in the process of gathering information, and" as a result 
*he information gathering designs allowed are often limited, to those of 
qujtstionable validity (e.g., quasiexperimental designs; see Campbell and 
Stanelj, 1963, and Campbell and Boruch, 1975). Only when policy-makers are 
aware of the alternatives for reliablf and' valid information gathering and 
of' the consequeftces of basing rationales on less than' adequate information 
can there be ad'equate evaluation. 

One particular way in which evaluations of federal educational programs 
are limited is in their effects on, the-schools in which they collect data. 
Teachers and local school administrators naturally eValuate the goals of 
national evaluations as of secondary importance to the main task of teaching 
their students; and because the operations of information gathering do con- 
flict with normal classroom activities , compromises. must be made in order 
for any information to be gathered. One direction for creative solutions 



tp methodological pro\)leins in evaluation may be in the negotiation ^of new" 
formB of compromise between ediic^ors and evaluatord. ^ : ' ^ k 

Finally, before tumlngj to the methodological tssues, we should point 
out'ttiat the decision-'orienoBd framework for evaluation that has been pre- 
sented is not the only framework for evaluation. As poiijt^d out by Floden 
and Weiner (197^), evaluations frequently have non-decision-making goals 
that complicate the identificiition of the information needed. "Evaluations" 
may be undertaken as a means of stimulating a project to take action, or as 
a form' of public relations, pr as a way of justifying decisions already made, 
or as a strategy in the development of an organizational power structure. * 
While these attivlties are in a sense demeaning for evaluators who take 
pride in their information s^therlng craft, they Nevertheless provide oppor- 
tunities for the practice' and enhancement of their craft. The ihethodological 
issues to be discussed are relevant tp these activities alsoV to the extent 
that the, information gathered mi-ght also i>e useful in futu^decisi9ns; and, 
moreover, they are likely to be especially difficult to deal with because 
the individuals allocating resourced to the "evaluation" are not motivated 
primarily tj^ obtain reliable and valid information* 

Evaluation of federal programs such as Title I has become a large-scale 
activity. Limitations on allowable infotjaation gathering continue to plague 
evaluators < however: This document will attempt to clarify the effects those ^ 
limitations have on the validity pf information gathered and to s-uggest poten- 
tial airections for searching for solutions. Evaluation. is viewed here as 
the gathering of information to te|t rationales for decisions, although the 
issues, to be addressed are also relevant to other information gai^j^ring, or 
'knowledge production, efforts.' The information needs for Title I evaluation 
can be generally derived from consideration of the types of decisions Con- 
gress, USOE, local school administiators, ' and teachers must make, and they 

r 

fall into seven categories: target population, participant population, 
' resource allocation process, local school managa<fent proces^, services, costs, 
and effectiveness. In ordetf to relat^ such information to decision-making, / 
on the other hand, it is necessary to answer several systemic quesiirf)ns about 
the Title I system, either^ as a pr Second it ion or as a ftertJ of evaluation. 
Infortnatlon gathering can be divided into four categories: design, sampling. 
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oeasureoent, and analysis; these categories provide the organilsatlon of the 
rest .of this document. ; However, the issues to be discussed will involve , 
political realities of evaluation that transcerd those four cata<?ories . 

> . . 




Design 

I 



Introduction 



The design 6t iaformation gathering is a dangerous topic for 
di»cu8Siof. Many policy-makers view the technical aspects of it to be 
da tails that technicians can carry out, and they are concerned only 
with moce global design issues; and many technicians view these technical 
datails as the entirety of the design problem and fail to consider the more ' 
global design issues. Ne'ither approach is satisfactw .7, however; the 
^ global issues and the technical details arjs actually closely interrelated. 
The best solution to. a technical problem may be a change in the global design 
rather than an increase in the sophistication bf techniques. Such a 
solution is proposed in the first issue to be considered in this section 
(the issue of the role oi "control groups*' in Title I evaluations), 
Although that issue has been considered by many researchers as a specific 
design procedure requiring further methodologies development, a promising 

avenue for resolving the issue may lie in^hanging the frame of reference 

* 

for evaluation. Policy-makers must listen to the expert advice of 
researchers and call upon other researchers to question and refine . 
evaluation designs, if evaluations are to^make use pf t^e^ recent develdp- 
ments In methodology. Especially fa the case of the first' issue, the 
methodological sophistication in the research community in 1977 is / 
significantly greater than a decade, or even five^years, agOr Methods 
long accepted throughout the research community have been foynd question- 
able, at least as thev apply to evaluations of coypensatoty education. 

The basic design problem, which has long beenr noted by i)hilQ.sophers 
o'f science (e.g.* Eddington, 1958), is that all scientific observation 
and interpretation is perfor- id in the context pf a theoretickl frame- 
work. Acceptance ot the information thus gathered often depends on 
the acceptance of the fr^mewort. In particular, the use of statistical 
methods, such as estimation of effects in the population from effects 
observed on a sample, is predicated on sets of assumptions that are 
rarely tested in evaluation studies. One^reason for this is that 
statisticians have demonstrated that many of the most common methods 
are quite "robust" withyresp'ec}: to some of their assumptions; that is, 
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the metho s would produce valid results even when the assumptions were 
not quite trtle. For example,- the common t-testi* asbuAes that random 
errors are normally distributed, but the test is quite valid even for 
sigidficant dep-irtures from normality. 

In the case f compensatory education, a recurring issue has concerned 
the validity of comparing achievement gains of Title ^ participants with 
the gains of a control group or of a standardized population. Methods for 
comparison of two groups in a psychological experiment are based on 
asaumpcions likely to hold true in the laboratory, but violated in the 
conduct of uncontrolled studies of ongoing programs in the field. One 
direction of resolution of these problems has been the "improvement" of 
statistical methods so that they involve fewer assumptions to be tested.' 
That process is incremental; however, it is more costly than is generally 
recognized, and quite often it has taker the form of veplacing one set 
of assumptions to be tested with another. The last deserves comment: 
it certainly is an advance to have two analytical methods that wq^k 
■for two different sets of assimptions rather than a single method; 
however, in practice having two frameworks requires the collection of 
extra information to test which framework is appropriate, which increases 
the overt cost of an evaluation. Plans for evaluations «hould explicitly 
inclrde.the assumptions underlying the observation and interpretation 
process and insofar as possible include plans for testing the assumptions. 



♦Student's t-test is a method for testing whether one group s scores are 
generally higher than another's. To carry out the test, /ff^ 
difference between th€| group means by an estimate of how, variable the 
scores are within each group. The larger the quotient, th* more - 
statistically significant is the difference lu the group's scores. The 
aim of this monograph is not t- serve as a statistical te«t, so particular 
ft^tiftJcal methods'will only described in sufficient <ietail to permit 
non-statistically ^rained reaaers to f.llow the discussion. The basic 
concept involved is that the truth of a statement is a function of the 
relative likelihood of obtaining a particular set of scores if the 
statement were true or were false. In the case of the t-test, it is 
the likelihood of obtaining a particular difference between groups if 
tho difference were real or merely a chance occurrence. 
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In addition tsf the general question of what type of comparison should 
be used i^ Title I evaluations, there is another question of frame of V 
reference that must be considered in planning an evaluation: What process 
is to be evaluated? /Ithough,' from a strict program evaluation perspective, 
it is the "Title I process" of allocating national resources t6 meet the 
special educational needs of disadvantaged children, the testing of 
ratTonales for decisidns may depend more on information about other 
processes, such as "compensatory education," however funded, or "indivi- 
dualized instruction," or the relationship between economic and education- 
disadvantage. A problem in trying to evaluate -he Title I process per 
se is the ability to separate those processes that have Title I as a 
cause from other processes occurring in the same classroom. Although 
superficially it appears "that local school administrators are able to 
allocate Title I funds to identifiable categories, classroom, dynamics 
preclude measurement of overall effects (e.g., if the effect of a 
particular Title I project is to pull students out for special reading 
instruction, the side-effects of this on the studeA^s remaining in the 
regular classroom c.annot be ignored in a compreheiiskfi-^aluation of 
that project). Because the issue of what process is to be evaluated 
is determined more by considerations of the use of information than by 
problems in the gathering of information,' it will not be considered . ^ 

as a separate methodological issue In this presentation. 

^Questions concerning the specific experimental design for information 
gathering have cente^^d on the use of quisi-experimental designs to substitute 
Jot randomized, or true experimental, designs. Design in this sense 
refers to the operationalization of tests pf decision rationales in 
terms of numerical relations to be observed among measuremeu.s of 
subjects (e.g., children) and the specification of subject selection in 
a way that will make inferences from numerical relations to tests of^ 
rationales meaningful and-, valid. There are dozens of common" "experimental 
designs" that evaluatcrs can apply to the evaluation task, but each is 
based on implicit assumptions that should be tested. It appears that 
at present we may be in a position in which none of the known "experimental 
designs" (including quasi-experimental designs) are both politically 
acceptable and able to provide vaUd tests of important decision rationales 
in Title I. This is discussed in Issue 1. - . 



Questions concerning the. longitudina,Xity needed- in Title I evaluation 
hav. centered on the paradox that children in compensatory education 
programs tend to learn about as fast as children who have no educational 
disadvantage, when measured from fall to spring; yet from year to year 
the children in compensatory education programs fall f^l'.ther and further 
behind their peers. Various aspects of this question are discussed in 
Issue 2 in this section. 

The- two issues discussed in this section by no meahs exhaust the 
methodological questions to be addressed in designing an evaluation study. 
They both focus on evaluations that aim to assess impact on students'- ^ 
perforroance. In addition to sampling, measurement, and analysis issues 
discussed in later sectiors. there are numerous "details" of procedure 
to which the director of ""a' large-scale program evaluation must attend, 
such as staff management, scheduling, liaison with program and project 
managers, data management, and report design. _ Ari evaluation is equally 
susceptible to loss of credibility from carelessness in these aspects , 
as ^rom statistical design problems. 

r«fl»e 1. To what khould Ti tle I treaiznente* be compared? ^ 

. Evaluation is^ot mere description. Where description is substituted 
for evaluation, important systemic question^ about the program have 
not been addres^d. The testing of decision 'rationales through the 
gathe-ing of information always involves the interpretation of that 
information as U relates to a hypothesized description of the program. 
Thus, there musi always be some comparison of the information on program, 
performance, iniour present case the achievement of children who have 
received Title i services, with a standard. Based on this comparison, 
the validity of ! decision rationales can be tested, and recommendations 
for policy can be developed. As pointed out by Stake ' (1967) .' there are 
two fundamental types of standards "for comparison: (1) comparison with 
what would have occurred were the' treatment not present- a relative 
comparison; and (2) comparison with a gokl-outcome that the treatment 



su»nary of the types ^f treatments funded under Title I is contained 
in a companion volume (McLaughlin, 1977).. 
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was intended to produce—- ai* absolute comparison,* 

In the case of relative comparison, the estimation of what would 
have occurred without the special treatment is the most significant 
design problem; in the case of absolute c;.oiftparison , the specification of 
the goal-outcome desire4 is the most significant design problem. Methodo- 
logies for estimating "what would have occurred" are noticeably further 
developed than the (more complex) science of educational goal-setting 
(e.g., the t-test is universally accepted, but goals for the schools 
vary from one community to .another). For that reason^ it would seem, 
compensatory education evaluations liave been designed for relative 
comparisons. The recent development cf "criterion-referenced tests," 
"objective-referenced curricula," and "cbmpetenqy-based education" 
(Spady, 1977) id, perhaps, a harbinger of a movement toward specification 
of goal-outcomes for coirfpensatory education, which would all6w absolute 
comparisons. Eoth types of coilrparison play a role in ideal program 
development, as shown in Table 2. They are based on distinctly different 
points of view, however. The type of question answered by a relative 
^mparison is "Did. the program have an effect?*', and the type cf 
question answered by an absolute comparison iJ^"Did the program meet 
the i^ed?" A relative comparison will not tell us whether the problem 
is belttg solved by the treatment, and an absolute comparison will not 
tell us whether the level qf\ expenditure is justified. A comprehensive 
evaluation strategy would require both types of ccmparisoni 

In order to resolve this issue, it is necessary to weigl\ the costs 
and benefits of the various alternatives for comparing Title I treatment^. 
We shall first consider the intricacies that have been discovered in using 
relative comparisons aJid then examine the pot*»ntial for the use of absolute 
comparisons, which have received litxie attention in the ten years of 
Title I evaluation. 

J 

*Lest the terms "relative" and "absolute" confuse the reader, it should 
be noted that in a sense all comparisons are relative. The way these 
terms are being used here refers to the dependency of the validity of the 
decision rationale being tested on the operationaUzation of some hypo- 
thetical modelt(e.g., what would have occurred in the absence of. Title 
i). A "relative" comparison is so dependent, and an "absolute" comparison 
is not. 
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Table 2 

Relationships between jOutcomes 
of Absolute ad Relative Comparisons 



Relative 
Comparison 





Absolute Comparison 




^— ^ 1 

Positive Results 


Negative Results 




Conclusion: 


Conclusion: 


Positive 
Results 


Program operation 
satisfactory. 


Need for more effort 
or reconsideration of 
goal^outcomes . 




„ Conclusion? 


Conclusion: 


Negative 
^Results 


Program effort can be 
decreased substantially 
or goal-outcomes need 
reconsideration. 


Need for redirection 
of program efforts and 
need for new methods. 



32 



Relative comparisons > For relative comparisons, the only method 
known for estimating what the perfoiaance level would have been without 
the treatment is to observe t^e performance level rof some other group 
not receiving the trejatment*. The selection of that ^ther group is crucial 
to interpretation of the comparison. There are three categories of 
alternatives: (1) random assignment of preselected subjects to treatment 
and no-treatment (i.e., standard school treatment) conditions; (2) selec- 
tion of a comparison sample in any other way; 'and (3) use of norms tables 
of estimated performance in the general population, published with 
standardized tests'. Random assignnent is n^ecessary for -the true experi- 
mental method; it involves the least threat to £he internal validity 
of evalu^ition but the greatest complexity in interaction with program 
operation. Random assignment, it should be noted, can refer to assignment' 
^ of students within a classroom to treatment and control groups, tfo 
assi^ent of schools to treatment and control conditions, or any other 
unit. The implications of randomization of different levels of units 
are discussed under Issue 3. The only major federal education program 
evaluations that have employed randomization are the ESAP and ESAA 
^evaluations (NORC, 1973; Coulson et al., 1976). 

- NoiTrandom comparison groups that have been used in major Title I 
evaluations include (1) students >n the same school in a prior year 
(Mosbaek, 1968), (2) students whose classmates werfe participants in 
Title I (Trism^n et al.,4976), and (3) students without compensatory 
education programs (trismen et al*, 1976). Nonrandom comparison groups 
have been used in numerous local evaluations, and current efforts by 
USOE to help local districts carry out Title I evaluations include this 
method (Wood et al., 1976). The problem with nonrandom comparison ^ 
groups is that there is no assurance that they are comparable to the 
treatment group prior to treatment. As we shall see* the ways in which they 
can differ are numerous, and testing for al.l the possible differences 



*A repeated measures design in which each child acts as his/her own 
control is an interesting alternative, but would involve extremely 
complex corrections because the goal of compensatory, an,d regular, 
education is to change the child, and the rate of individual growth 
varies in complex patterns from year to year. 
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•0 <m« can oake the correct adjustment of the comparison borders on 
the infeasible. On the reverse side of the coin, there isi by the fact 
that Title I is designed for a subset of the children in a school, 
nearly always some comparison group nearby that can be inexpensively 
tested to provide comparison data. One type of comparison data available 
in mai^V school districts Is cumulative growth curves for children in 
the district. Although sublect to problems, these local norm data are 
usually preferable to the use of national norms. 

^ '.lie use of national standardization data has much the same set of 
problems as use of a nonrandom comparison group. The problems are ^ 
compounded by the fact that, unlike a contemporary local comparison 
group, one cannot observe what variety. of experiences and traits character 
ize the national comparison group. The problems associated with use 
of the norms tables of standardized tests are discussed under Issue 6. 
Nevertheless, such data, in the form of gains relative to typical 
performance at a grade level (grade-equivalents), have been used by many 
states for %heiT annual Title I evaluation reports and thus by 'the 
federal evaluators who aggregated the state reports. Of course, for the 
local evaluator, use of norms tables is the least expensive method 
for generating a comparison of a treatment group's performance. 

In order to evaluate the usefulness of these three . methods for 
performing relative comparisons, we must ccAsider the various costs 
generated by each alternative. ,Four types of marginal costs must be 
included : 

1. costs of collecting the needed data for comparison; 
■ 2*. costs of producing the dat? (incurred by teachers and students); 
> 3*. costs in lost validity and in resulting lost credibility of 
the evaluation's findings, compared with other methods; and 
4. costs for development of the method. 
These losts offset each other, and they apply to absolute comparisons 
as well as' to relative comparisons", therefore, it Is essential for an 
evaluation designer to understand their differences and to be able to 
compare their values. Because the credibility of the findings is 
partially dependent on what the findings are (that is, whether or not 
they conform to results desired by groups in a position to attack .their 
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Villldlty)^ policy-makers and evaluation designers must choose whether to 
gamble with an "Inexpensive" design and hope that results prove iiotcot^ 
versial or to%e conservative and use an "expensive" design. A pilot 
evaluation is a useful tool in this situation. 

For random selection, marginal cdsts are nearly all in the category 
the^Jlata^TTh major cost, invariably given as the reason 
for ruling out randomlz a t ion , fs^'the^ wi thho Iding of Title I benefits ^ 
from the unlucky needy students selected to be in the control «roup. Thef 
law specifies that Title I funds are to be Used to meet special edt|catloHal 
needs of the youQg people with the greatest needs , and program adminis- 
trators and teachers are reluctant to comprdtiiiee tjiat principle* merely 
for the purposes of valid evaluation. This is a constraint within which 
evaluation must be carried out. Proponents of randomized designs must 
find -rationales for randomization tliat will meet tdie objections of 
administrators and teachers. 

Several such rationales have been suggested (e.g*^ Campbell and 
Boruch, 1975). First, one might argue that there is little evidence 
that missing out on the program for a year has lasting effects on one s 
education; after all, "no Title I treatment" does not mean •'tio instruction." 
Finding that local districts, teachers, and parents do iJbt readily accept 
this argument^^wuld indicate by itself that these people, at least, believed 
the treatment to be effective. 

A second design for randomization is conceivable when more than one 
compensatory service is available, only one of which a student can 
receive at a time. For example, if there are compensatory reading and 
mathematics classes, then it might be reasonable to assign needy students 
randomly first to one for a year and thien to the other for a year. This 
would be questionable if children normally were behind in only one of 
the subjects: assigning a child with math difficulties to a compensatory 
reading class might be counterproductive. The results, at least for 
the first year, wcJuld be a randomized design in which each compensatory 
group was the control for the other. This presumes that the content 
of two instructions has 15 t tie overlap (otherwise the evaluation would be 
too stringent;, pitting two coin)ensatory classes agaiast each other); a 
pres.umption probably false in the primary grades. Very sensitive tests 



would be necessary to dif ferentiiate gains of two claase^jjhose objectives 
overlap. 

— — - — — — et 

A third randomization design would be to withhold compensatory 
education service from a randomly selected group of stuaents for a year 
and invest the money saved in a trust fund for those students. Although 
this possibility is bizarre, it should not be dismissed without consider- 
ation. Perhaps the most difficult problem for this design is the fact 
that there are substantial side effects of the introduction of Title I 
funds, into a school that would. not be felt if the money were in the bank. 

A fourth design for random -assignment can be used when there are not 
sufficient Title I funds to serve all the needy studejits. Rather than 
dilute the program's effectiveness by giving each student less service, 
and rather t±an assigning funds on some basis such a^ ability of a teacher 
or school administrator to write a good program proposal (which may not 
be indicative of the actual service delivered), some of the funds 
could be assigned randomly. This procediire is fair and can be agreed to 
in adv^ance. Although it would be infeasible to implement at the level 
of selecting individual students, it proved feasible in the selection^ 
of schools for ESAA money 1^ th^ evaluatidh designed by USOE and carried 
•out by the System Development Corporation (1976). As that evaluation 
showed, however, it is necessary to have advance agreement that no 
compensating local resources that might affect the levefl of performance of . 
students in the control schools will be allocated to those schools during 
the period of the evaluation. In that stu<iy, because the Office of 
General Council held that USOE administrators could not affect the 
allocation of other resources to make up for ESAA allocations, the 
evaluation Was compromised. The gene^ral heuristic of substituting a service 
or value to be provided after the evaluation is completed appears to 
be a reasonable compromise between program operation and program evalua- 
tion. 

A fifth possibility occurs in districts with a wide range of economic 
status, where it is required that local administrators select the schools 
senjing the most economically disadvantaged children to receive Title I 
assistance and demonstrate that non-Title I schools are not receiving 
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coorpeiraatlng resources from other sources. In these cases, random 

assignment of a few groups of children to Title I and non-Title I schools 

would prqjflde the basis for an overall comparison between the Title I 

and non^Tltle I Schools, although it Vould be difficult to make Inferences 

about the effectiveness of particular methodo in such a design. 

Finally, if the base of comparison were taken not as between the 
Title 1 treatment of Interest and the standard Instructional treatment 
but rather between a Title I treatment of interest and a standard that 
is agreed to be highly effective (although possibly too costly for wide- 
spread use), then random assignment could easily be justified, the aim ^ 
of tjiiis comparison would be to shpw whether the treatment of interest 
was as good as the "standard of excellence/' presumably at less cost. 
This provides a]> argument for the identification of at least one method 
of compensatory education, however costly, that can be assumed successful 
wherever implemented. 

To summarize, the costs of randomization are nearly all in terms of 
services withheld, and several rationales exist for compensating for or 
justifying the wlthholdlflg of services. Of course^ a thorough considera- 
tion of randomization would have to investigate secondary costs for the 
teacher and for other students: for example, the greater claaaroom 
^homogeneity of achievement level when low achievers are taught separately 
might possibly benefit noncompensatory classes as well as the compensatory 
classes (although the results of Trismen et al. , 1975, suggest not) — 
randomization removes that possibility. However, In view of the marginal 
costs of the other methods to be described, randomization deserves careful 
consideration (as in Conner, 1977) for future evaluations that require 
relative comparisons. 

There is another cost associated with use of randomized control 
groups that applies .equally to nonrandom comparison groups but not to 
other comparison methods. This is the cost of assuring that the control 
group is not affected by the Title I service; if it is affected, this 
would bias the comparison. There are numerous sources of subtle effects 
of which the evaluator must beware and which he/she must either avoid 
or measure and correct for. If students in both groups are in the sane 
classroom or even the same school, some peer teaching of skills learned 
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in' the compensatoiy treatment will be very likely to affect other students; 
i* a group of students is aware that they afe being used as the control 
^roup, competitive spirit will lead to greater achievement than were 
/there no evaluation (the "John Henry effect") ; teachers are likely to 
discuss with each other methods that have been successful, thus , spreading 
their use; if both groups are in the same. classroom, the teacher may 
notice "mistaken" assignments to treatment and control groups and reassign 
students to achieve maximum benefit from the compensatory education 
program, ignoring the effect of this on evaluation; and districts may 
unconsciously tavor schools not receiving Title I money with other 
opportunities "in order to be fair," although that is precluded by 
Title I regulations. Thus, tandomization or other methods^ of selection 
. of- a control group will have costs associated witji the proximity between 
treatment and control groups that other comparison methods do not.- 

We turn now to nonrandom comparison groups. The problems of 
evaluations involving nonrandom comparison groups have been discussed at 
greater length. than any other methodological topic in the relevant 
literature (e.g., Thomdlke, 1942; Campbell & Stanley, 1963; 
Campbell & Erlebacher, 1970; Glass, Peckham, & Sander*, 1972 ;«, 
Kenny, 1975; Porter & Chibucos, 1974; Sherwood, Morjriia, & Sherwood, 
1975; Campbell & Boruch, 1975; Boruch, 1976; Reichardt, 1976.) 
Although we leave the details of the methods of analysis when one has 
nonrandom control groups to the discussion under Issue 8, we shall 
consider the problem generally here in order to understand the costs 
involved in choosing to use a nonrandom control group for a relajtive 
comparison in evaluation. 

The basic problem is that the treatment and comparison groups must ' 
be decermined to be equivalent in all relevant aspects, so that they 
can be compared "as if" the selection had been random. That equalization 
which is a form of interpretation ot observations, depends on assumptions. 
Those assumptions are numerous, and testing them is both necessary 'and 
costly. While there have been notable advances in expanding -the available 
methods for correcting for the nonequivalence of control groups, there 
. have been equally notable additions, especially by Donald Campbell and 
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his associates, to the^dist of problems that must be dealt with in ^ 
analyzing data from Vquasi-experiments" (as Campbell & ^Stanley, 1963, 
referredx>to designs without randomized assignment). 

In order to understand tl\e scope of the problem, let, us consider 
the simplest form of correcting for the nonequivalence of control^ groups, 
one that has been berated often, is still oftep used, is really no worse « 
than some more sophisticated methods, and one form of which was recently ^ 
strongly defended (Sherwood, Morris, & Sherwood, 1975). Tfeis method is 
matching ; for ee^ch treatment subject, a control subject is selected to 
be aa similar as possible to him/her before the treatment, ^and differences 
are measured between the pairs^on completion of the treatment. Tl\e ^ 
fbllo%dJig list of problems with this'metliod, taken from Campbell & ^ \J 
Bo^ch (1975), is incomplete, but will show the extent the , problem ^ 
It should be noted that the methbd proposed by Sherwood, Morris, & ^ 
'SherWood (1975) may not be subject to many of these problems*, bepause 
theirs was an attempt to match pairs exactly — on dozens of dimensions 
simultaneously. * These problems listed are^primarily for the case in 
whijch matching is on a p'tetest. • , . ^ 

1. Differential regression to. -the m c Tan; Children selected by their 
teachers as needing compensatory instruction are likely to have 
obtained low pr^t^t scores because their true scores are low; 
however, those noncompensatory students whose low observed scores 
match the compensatory students are likely to have obtained the 
low scores through random error. On retesting, theiT scores would 
be expected to be higher than the matched compensatory students 
because the random error would not be likely to be in the same 
direction. The problem is that^matching is otf observed scores', not 
on Xunobservable) true scores » and the result is that compensatory 

education can look bad entirely due to the statistical artifact. > 

j 

A hypothetical example is shown on the next page. 
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2. Differ ential growth rates : Children, who learn more slowly are 
farther blhind their peeV group ^t any age, and conversely,^ 
children who are farther behind t6nd to have a slower learning 
rate, at least In most cases* Iwo students may havte achieved 
the same level, however, and have different growth rates, for 
example because 'the slower student was given extra help^. Now, 
an Intelligent teacher Is likely to be abla to discern which 
of two children sco^ng low on reading has a ipeal learning 
problem requiring compensatory instruction and which is metely 
not perfgrming up '-to his/her capabilities and can be expected. 
' to cope with the tasks in the regular clasd. It would surely • 
be unfair to the compensatory treatment to match these two 
' 'children for the purposes of evaluation. See the example b-'^ow. 
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3» T»ct floor ^.ffects t Each achievement test la dealgtied for a 
particular range of ability levels. However, If the test used 
for an evaluation Is no^is^ry carefully chosen, some*T[ow 
achievers may In fact have pretest true ability levels slgnifl-* 
cantly below the level needed to exceed chance (pure guessing) 
, p^x ':'oittance scores, iror ^^xample, In the Compensatory Reading 
Study (Trlsmen c \ 1975), there were numerous means for 
groups of compensatory reading students that were below the 
guessing level for the testL* The pretest scores of students 
wiio purely guess will be positive, however, because some guesses 
will be corirect, and they v^Ij. be matched by controls whp pertorm 
at chance levels that reflect their true scores. In the course 
of a school year, the treatment and control students might l^am 
an equal a ount; but that amount might not be enough for the 
treatmemt students to exceed chcnce levels. Thus their observed 
gain would be zero, compared to a positive gain In the control 
group. See the example ^elow. 
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In each of these cases, a test could be made for whether the particular 
bl'Mij.ng effect actually occurred *and a correction made, for e:^&nrple, 
parallel forms of the test could be given to each student to estimate the ^ 
amount df regression to the mean, and Scores could b'^ corrected before 
matching. Measures of growth rate could be obtained by administering — 
several pretests over a period of years preced-'ng the treatment. Tes*- 
floor effects can avoided ..by pretesting the tests before using them 
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foT the evaluation study or by the development and use of wide range tests; 
•uch a« the sequential branching tests that can be administered under 
computer control. Each of these operations adds significantly to the 
cost of the evaluation, however, and it is not too cynical to expect that 
a sophisticated methodologist will be able to find some new source of 
bias after the study is completed. In some cas«s, it may be expected 
that the results will be so clear-cut that statistics are hardly necessary 
(for example, if all students in seme compensatory ptogTtfm scored In the • 
bottom hllf of thelrcla^J on the pretest and in the top half of their 
class on the posttest, no stat stical artifac;:s would be important). 
It also may' be that the results- will be noncontroversial (foy example, if 
they are merely tb^corroh orate results obtain-d from different methods 
of evaluating tKe particuUr program)./ In these case', he >-ess4re on 
•internal validity is not as great, and one migjvtr' concU that the cost 
in units of credib^ility may not justify abandoning j:he altema'tlVe of 
.matching. The history of politicization and controversy of Title I 
evaluations, however, suggests caution in sacrificing validity- to save 
other costs. 

Aiiotiier approach to this problem, which has its own costs, is to 
develop- airtight methods for interpreting results based on nonrandomized 
studies. Porter (1967) and Kenny ,(1975) , for example, have imprpved 
the methodology (to be discussed under Issue 8) , and the National Science 
- Foundat:.on and the National Institute of Education have! recently been 
supporting .some research into bettor methods, so the possibility of the 
development of improved analysis prbcedures for noncompcrable control 
groups should not be dismissed. The proper method is not apparent in 
1977, howfever, and t'-.-e is no ' guarantee of solution in the near 
future. Nevertheless, more intensive effort in this direction seems 
warranted, unless either randomization becomes politically acceptable or 
evaluations change toward absolute comparisons rather than relative 
comparisons . 

The third method for relative comparisons is to compare the Title I 
participants with the "norm group," that is with the scores of the 
representative national sample of students who took the test before it 
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was published in order to establish the meaning of the raw scores in £erms 
of comparison to the population. The model used for such comparisons in 
an evaluation is the "equal growth'* assumption. This is the assumption 
that, under no special treatment, a student who scores at, say, the 20th 
perccatilc relative to ethers at his grade leyel_(or 7 mor^ths bzlm 

leve^or^ ltems^or 1 stM^ devlition below the mean) at the 
begimiiDg of one grade is expected to score at the 20th percentile 
relative to his peers (or* 7 months below grade level or 10 items or* 1 
standard deviation below th mean) at •the beginning of the next grade. 
All of the validity problems of nonrandom control groups apply equally 
to this method of comparison, and it also is subject to the numerous 
problems that arise from reliance on norms (sfee Issuey6). Moreover, 
Kaskowltz and Horwood (1977) have presented data that indicate that 
the equal percentile growth assumption leads to underestimation of 
expected gains of students at the lowest percentiles, based on data from 
recent evaluations; and the distortions associated with use of grade-, 
equivalent scores are well-known (see Issue 7). 

In view of the numerous problem ^ associated with use of test norm 
^ata as a comparison standard for compensatory education evaluations, it 
is distressing to find that most evaluations carried out to satisfy the 
requirements of Title I have hc^n based upon that type of data (see the 
discussions of local and state evaluation reports by Wargo et al., 1972; ^ 
Gamel, Tallmadge, Wood, and Binkley, 1975; Thomas and Pelavln, 1976). 
The use of suchHata is even recommended as one alternative for future 
local Title I evaluations (Wood et al., 1976). Only when special research 
studies have been commissioned by the federal government and carried 
out by leading research institutes have there been comparisons with control 
groups C^ost notably the Compensatory Reading Study, Trismen et al. , 
1975, and tHe oSustaining Effects Study, System Development Corporation, 
1976). 

The cost of this method (norm comparisons) in terms of data collection 
is minimal, but from the point of vi^w of validity it is substantial. 
For the purposes of 7 elative comparison in evaluation, its use should 
be ^rrroborative rather than as a sole means of comparison. The costs 
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o£ development tQ establish adequate validity for this method include not 
only the coats associated with developing methods for interpreting 

comparisons with nonrandomized control group^. Jut also theyL_inclu4e „ 

the-cosfs oT reTined standardization, in which distributions of scores 
in the norm sample are crosstabuiated with numerous demographic and 
other factors that might be used to match the treatment group to a 
subset of the norm sample. 

■ As w» hM seen, there are substantial problems to' be dealt with in 
the use of any of these alternative methods for relative comparison. Any 
one of them might provide the answer: if a politically feasible method 
of randomization were developed, or if sufficient statistical methods 
for equating nonequivalent comparison groups were developed, or- if 
sufficiently reliable and valid test norms were produced. The stakes are 
sufficiently important (Title I is spending about $2 billion annually 
and is substantially affecting the education of 5 million children 
annually) to warrant strong efforts in all three directions. It. is our 
belief, however, that a fourth alternative, turning to absolute comparisons 
in the' evaluation of TitU I impact, is also Viable.^^and we have taken 
the next few pages to discuss that alternative. 

■Absolute comparisons . Absolxite comparisons involve -.oiuparison of a 
treatment group's performance with an agreed-upon standard, irrespective 
• of any control group's performance or, real'/, of any form of expectation 
•for the treatment group's performance. F ur types of absolute comparison 
standards, shown schematically in Figure 2, appear to be reasonable for 
tKfe evaluation of impact of Title I on children's educational attainment: 

1. specified minimum skills to be achieved at each grade level; • 

2. specified maximum delicits from tb-. population average to be 
allowed at each grade level; 

3. specified minimum amounts of skill acquisition per year of 

school; and 

4. specified minimum amounts of deficit reduction relative to the 
population per school year. 

The first two standards are for achievemant xevels at r le conclusion of 
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^^^4rtlcular^^I4*ie-^-*re«tinentnr-ai^ 

ania* . The first and third are in- teruM of Absolute skill levels, and 
thft second and fourth are in relation to vhat skills the population as , 
a whole possesses. Although all but the first are expressed as the 
relationship of pcsttreatment performance relative to some other perfor-, 
mancc level (pretreatment or the general population) , they are neverthe- 
less absolute comparisons in that they can be agreed upon ahead of time, 
their validity In no way depends on the ability to find an equivalent 
control group with which to compare the treatment ^roup. For example, 
of in the second type of comparison, the criterion for concluding that 
Title,! is having the proper Impact is that every participant's perfor- 
mance be at least at the 25th percentile of the population distribution 
upon coitpletion of the treatment*, .ic is immaterial how the particular 
treatment group diffftred from typical students in the population prior 
to treatment. 

As with the alternatives for relative comparisons, the types of 
costs for the four methods or absolute comparison yary, and careful 
analysis must precede selection of the appropriate method. The only 
applications of. the methods to Title I evaluations have been in the 
searches for exemplary projects (Wargo, Campeau, and Tallmadge, 1971; 
Horst & Tallmadge, 1975), and a substantial amount of development 
will be necessary prior to their widespread use. Recognition of the 
need for such development is apparent from the attempt by Horst and 
Tallmadge (1975) to achieve a measure of what they termed "educational 
significance" in terms of a comparison of the fourth type. They proposed 
that In a iearch for successful projects one require not only that a 
gain be statistically significant, but also that the amount of the gains 



*Thi8 is not paradoxical: it requires that the distribution of skills 
be truncated at the 25th percentile, so that the raw score for the Ist 
and 25th percentiles would be essentially equal. 
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be at least 1/3 population standard deviation*. Kaskowitz and Norwood 

(1977) have pointed out; the need for Improving on this arbitrary criterion 

before extending its use to other evaluations, implying that it will 

be extended whether it is refined or not* Horst (1977) has investigated 

the relationship of gains of 1/3 standard deviation in a school year to 

typical amounts learned in a year. He found that to gain 1/3 standard 

^deviation a student who is one standard deviation below the mean in an 

early grade must learn twice as much as would otheirwise be expected and 

a student in an upper grade must learn three or four times what is 
« 

normally learned in a year. 

Of the four types of absolute comparison, there is little difference^ 
in data collection cost: the only variation is that pretreatment perfor- 
mance levels must be obtained for the third and fourth methods in order 
to calculate gains at the time of posttreatment testing. 

Costs for development and for credibility are interchangeable. With 
a minimal effort, experts could be brought together to draw up a tentative 
list of skills to be achieved at each grade level, for example, but 
selecting a single set of skills and gaining universal acceptance for it 

*The population standard deviation is an estimate of how one expects 
particular scores to be from the population mean on the l ^rage. For a 
normally distributed score, about 68% of the scores are within one 
standard deviation on either side of the mean, and about 28% more are 
between one and two standard deviations from the meait, as shown in Figure 
3. A gain of 1/3 standard deviation for an individual at the 16th 
percentile, for example, would move that person to the 26th percentile. 
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ia a mind-boggling task. We will consider here the developmental costs 
in some detail for the first method because it is potentially the most 
far-reaching, and the qosts for the other methods involve primarily subsets 
of the cost components for the first method. 

• First, a method for deriving minimum proficiency levels at each 
erade level must be agreed upon. Two alternatives present themselves. 
The first involves working backwards from minimum proficiency levels that 
•«re to be obtained by the end of 12 years of schooling. Oregon, California, 
Michigan, and a few other states are beginning to implement a policy of 
minimum proficiency testing for high school graduation, with each local 
school district developing local minimum standards. Spady (1977) has 
suggested that this will be a widespread practice in the near future. 
There is a sipificant problem in "working backwards" from exit-level 
requirements to requirements for each grade level, in that there are many 
alternative paths to the learning of basic skills. While there has beeif 
« great deal of research on the hierarchy of skills involved in reading 
(Williams, 1973) , and the National Institute of Education has f ocused_^ 
a largp research effort on the process of learning to lead, there has 
been no attempt to translate the results into a set of alternative paths 
toward minimOm proficiency. 

The second approach to' establishing minimum levels at each grade is 
tHeoretically less ambitious and more appropriate to the basic assumption 
of Title I that compensatory education can bring students back into the 
mainstream where they can benefit from regular school instruction. This 
approach is to establish the skills necessary fon benefiting from, each ^ 
particular classroom's regular instruction and set the goals of the 
previous grade's compensatory instruction to h ring as many students as 
possible ap to that level. This could be facilitated by cur-iculum 
, developers' specifications of skills needed for their published materials. 
Of course, the cost of. generating such specifications (correctly) would 
be quite significant, and whether that cost should be reflected in higher 
costs for textboeks or treated as a governmental responsibility is not 
clear. 
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Of tkese two approaches, the former has the advantage that it treats 
the schooling process as a single system that produces skills in students 
leaving the system sufficient for coping with life's problems. The latter 
has the advantage of more easily fitting into the existing education 
framework in each school district, but it requires that teachers in 
successive grades get together to set up oblectives for compensatory 
education, and it ignores the needs of children who frequently switch 
schools.' On the dimension of evaluation credibility as a tool for 
prograiii development,- r.he audience must be specif 1-ad So. decide between these 
approaches: for the local district, the second approach is mote beneficial 
in that it facilitates incremental improvements within the existing 
framework. At the national level, where the concern is for preparation 
of the adult citizenry of the next genetation , the results of the first 
approach are more meaningful. To arrive at a choice between these methods 
clearly requires additional work. ^ 

« 

Turning now to the second type of ^absolute comparison (using a 
national average), theoretical problems of defining what skills are 
redlly necessary are-replaced by the empirical problem of determining the 
percentages of children at each grade level possessing various cognitive 
skills and. by the systemic problem of determining hwo close to "equality" 
of achievement to aim for. Of course, absolute equality of achievement 
is an unattainable and indeed undesirable goal in a f'ree pluralistic 
society. 

One answer to the question of "how equal" the program should aim 
students is to use data (for example, from the National Assessment of 
Educational Progress) to estimate the percentage of young adults nation- 
wide who do not possess minimum proficiency levels agreed upon by experts 
and, after correcting for various statistical artifacts such as varying 
rates of early dropping out of school, set that percentage as the goal 
for compensatory education. For eccample, if it is determined that 15% 
of the young adult population^ is mathematically iilctmpetent upon high 
school graduation, this means that 85% are judged at least minimally 
competent (i.e., not requiring federal inter^/entlon) . Roughly, this 
implies that if the performance of all students at each grade level is 

'■13 . ' ' 



42 



maintained at levels within what is the top 85X (above the 15th percentile) 
of the existing population, all students graduating from the system in 
thtt future will be mathematically ccJmpetent (by standards of the 1960s ^ 
and 19708)*.' ! 

Contrasting the first two*^ypes of absolute comparison, one arrives 
at the conclusion that the decision between these two methods for 
•valuation depends on the answers to crucial systemic questions about 
th* role of Title I in society: is it to ensure a certain minimum skill 
level or to ensure a certain approximation to equality of achievement? 

The third and fourth types of absolute comparison both differ from 
the first two types only by taking into account the students' levels of 
performance at the beginning of participation in a Title I program. We 
can, therefore, discuSs them as one. The primary advantage of expressing 
ioals in terms of gains rather than absolute Xevels is that failure to 
meet the criteria can more easily be attributed to deficiencies in the 
program: of two programs evaluated completely in terms of posttests, 
one might appear more successful m9rely because its students were further 
advanced at the beginning of the program. It would be wrong to select 
programs on that basis. The main drawback of using gains as the criterion 
is that they do not relate directly to practical criteria, such as 
possession of particular skills after 12 years of school, possession of 
skills necessary for regular instruction in the next grade, or perfor- 
mance at a specified level relative to the population. If a student is 
sufficiently far behind upon entry, then even epctraordinary gains may 
leave hlm/het below desired posttreatment levels. One way in which 
criteria could encompass both concepts (uslilg gains and relating to 
absolute posttreatment levels) is to specify that gairis should be enough 
to close the gap between pretreatment levels and desired levels by a 



*The reader should not fall into the trap of worrying that there will 
always be a bottom 15%. Of course there will; however, the goal would 
be for their skills to be above what is now the 15th .percentile. As 
society changes in the 1980s and 1990s that criterion could be 
expected to change. 
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significant fraction, such as halfway, if the student has more than a 
particular specified deficit on entry. A more thorough solution would be 
to perform bdth types of absolute comparison (i.e., with and without 
correction for pretreatment performance levels) . The Implications for 
policy are shown in Table 3; they are analogous to the differential 
inqilications from absolute and relative comparisons shown in Table 2. 

Smmnary . *Tl>e problem of what to compare Title I treatments to is 
complex and involves careful analysis of the program's basic assumptions. 
We-hava considered seven- classes of alternatives: relative comparison 
using randomized assignment, nonrandom comparison groups, sad national 
norms as standards, and absolute comparisons involving either posttreatment 
vlevels or gains and either prima facie skill requirement specification or 

oecification in terms of the skill level in the society. We have not 
coriBidered any number of other dimensions that should be in a more 
thorough treatment of this subject: Are comparisons with, say. Great 
Bri talk relevant? Are comparisons with the society's costs of dealing 
with fun^ionally illiterate adults relevant? Are comparisons with state 
compensate^ education programs relevant? One can certainly imagine 
rationales fc^r important decisions that would depend, at leasfe-^p^f^fly; 
on the answers to these questions. 

Although it is impossible to rule out as inappropriate any of the 
seven categories of comparison discuised, it appears that much of the 
lack of direct impact of evaluations on program operation (Cohen & Caret, 
1975) may be due to complete focus on relative comparisons, which merely 
test whether a program is better than what was ^ing done previously, 
rather than absolute comparisons of whether the program is achieving 
sp^ific educational goals. The change of focus toward absolute evaluations 
is needed and shows signs of occurring in the near future. 
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A Table 3 



llplications of Annual Comparisons of Gains o: Posttreatment llevels 

( , , . . .. ... 







Comparison of Posttreatibent Levels 






Positive Results 


Negative Results 






• 

Conclusion: 


Conclusion: 




, Positive 
Results'^ 

• 


Program operation 
satisfactory. 


Greater ,program effort 
needed; or gain criter- 
ion needs revision; or 
posttreatment compari^ 
sons should await 
another year's gains 


Comparison 






/ 


of Gains 




Conclusion: 


Conclusion: 




Negative 
Results 


Program effort not 
dealing with a clear 
need. 


Program needs redirec- 
tion, new methods. 
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I89U0 2. la longitudimt evaluation necesaary? 



A longitudinal design Is one which requlr|g^g.itttM!ollectlon of d^ta 
from th^ same source two or more times over a period of time. In order 
to determine whether a longitudinal* design Is necessary for Title I 
evaluation. It Is necessary to examine the information needs guiding the - 
evaluati^on to determine whether the^ warrant the expenditure of effort 
required for longf^uJinal data collection. The first question Is whether 
tempofalK.rejlatlonaL^formatlon Is necessary. If It Is, the next 
queetloh ^ls^*whetAier less problematic designs, retrospective data collection 
ot cross^-Mctlonal designs, would provide sufficiently valid Information. , 
The *^Inal guest^lx)fi^*'is. If tempoul relational information Is required, > 
over how^^^i^g a*pe'r^bd of time md^t the data be collected. 

Th6 answer to the first question Is that for evaluation of impact on 
children's school achievement, although not so clearly for gathering 
information on compensatory education processes, temporal relational 
Information is, likely to be necessary. As long as the Impact is measured 
in terms of gains, starting froih disadvantage, there must be some version 
of a "before" and "after" mea e. If the framework of comparison wer^ 
to be oriented to th3 comparison of posttreatment levels with a standard, 
irrespective of pretreatment differences, temporal relational information 
would not be so triportant; however, that would require a substantial break 
♦from the cum:ent evaluation tradition. 

The temporal* relational information normally required has three 
components: (1)*^ measure of a child's achievement level prior *to the 
Title I treatment, (2) a measure of the child's participation in the 
treatment, and (3) a measure of the child's achievement level following 
the treatment. ^The second questio^ we posed for deciding on longitudinal 
designs was whether the information could be gathered by easier methods. 
The easiest would be p^trospective data collection, use of a respondent's - 
memory 4o construct temporfT relational information; however, that is 
not feasible for the assessment of pretreatment achievement levels • As 
survey methodologlsts have' frequently pointed out, the reconstruction of 
previous subjective variables has little vaJidity, and retrospective 
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quMtio^ shotxld \m limited to questions concerning actual, objecti"e 
«v«nt«, *uch a^ switching of schools, participation £n special classes, 

8C on. This lack of validity of retrospective subjective reports 
a^pliM also to th.i reports of teachers that^ (although the test dcores 
■ay npt show it) the children in their classes improved "significantly" 
durlngiparticipation iVa program (e.g.. Steams, 1977). Although 
teachers may be quite sincere in these reports,, there is a great likelihood 
that' they may be based on the teachers' unconscious selective perception 
of behayiprs tl.at matched their expectations or desires. 

• ..The second alternative . is a cros8-8ecti<?nal design. Bather than 
collecting' pretreatment and postfreatoent sc^ores on the same students, 
it might suffice to collect them on different students. The ^reasons for 
doing this might be (a)* to circumvent' the methodological probjlem with 
pretest-treatment-posttest designs that the pretest may itselt affect 
the way -in which the treatment is perceived and assimilated ijy student., 
or (b) tJ.shcrten'the time needed to study a long-term treatment (e.g.. 
one ,couli& estimate 4-year gains by measuring 2nd' and 6th graders in a 
it the same time). 

The primary requirement for inferring temporal relational inf orma-, ^ 
tion froJ|i cross-sectional/designs-is that the^ samples on which the 
different measurements are made<,be ' equivalent in all relevant respects. " 
For cross-sectional designs aimed at the first of the two problems (effects 
•from the preteat) , this can be accomplished by randomly assigning, 
students to either a pretest-treatmenf or a treatment-posttest condition 
or, better yet. by randomly pretesting only half of the students, pOst- 
teiting all of them, and testthg for the existence of a pretest-treatment 
Interaction. There has been' little, if any, use of such a design in 
Title I evaluations to avoid pr-.est-treaitment ir.teractions. probably 
because of other advantages, to be listed below, of true Iqngitudinal , ^ 
designs. One particular problem that has been rarely recognized is that, 
in comparing a treatment's gains with a national test ^orm, the students 
are normglly taking the test (a parallel form) for the second time at 
the posttest whereas *the norms were developed on students taking it for 
the first time. This Is one of toe Utehy problems in using test norms 
for program evaluation to be discussed under Issue 6. 

54 ^ •* 
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The iwre practical reason for using cross-sectional desfigns is to 
shorten the data coll*iction period. While there is ^no problem in 
ccwaiasiotiing evaluations that measur^ pre treatment lachievement levels in 
the fall mid posttreatment ach^veVemei^t levels in the following spring, it 
is much more costly, in many ways, to measure pre-to-post gains over a 
period of several years to test rationales based on long-term effects of 
coHpanaatory education. 

The primary iss'»e of how loug a Title I treatment one should 

measure the effects of that treatment has proven to be an important 
isaue» bec^ose of reported results (e.g., Thomas and P^lavin, 1976; 
Pelavln and David, 1977) that students show good progress when measured 
from a pretest in the fall to the post test in the spring of the same 
year, but looking over the longer trend, the students who are Title I 
partic£p4in:s tend to fall further and further behind witK each grade. For 
this reason, the question of whether there are long-term, sustained effects 
of Title I is no*, answered by evaluations of short-term gains. The evalu- 
ation of these effects is the goal of the current evaluation of the 
sustaining effects of compensatory education being carried for USOE by 
System Develd^>ment Corporation. 

The - fstion of whether Title I ::liou!d be evaluated in terms of 
achi-'vement gains with a the school year or over longer period depends 
«n fundamental systemic questions about the aims of Title I. These aims 
are not to provide a separate school track for the educationally disad- 
vantaged, in which each grade teaches one set of materials to' compensatory 
students and anothcii to noncompensatory students, but to teach the skills 
aecesi^ary to bring children up to the level of competence necessary to 
benefit from noncompensatory instruction. Tlie consequences of this 
view of the purpose of Title I are substantial. For ex2:iple, it leads 
to the 'allocation of funds to the early grades, to ensure that children 
who start out with a home life that does not provide them vith the 
prereq- isites for 4iandling bchoolwork successfully will be brought up to 
a level at whiph they can cope with their school tasks as soon as possible. 
Tha altemativr is t is that the students who are in Title I wil" need a 
continuing spec eu. .ition program becauoe of thei: lower capacities 
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for learning (for whatever reason), in which case the expectation is 
that at the end of the first year of a treatment they will have : lamed 
more than if they had not had that treatment, but that nevertheU " 
they will "need to continue the treatment in succeeding years. In this 
view, the role of the Title I treatm-int is not to give them aome basic 
skill that al' /s them to catch ,ip, but rather to provide, a differeat, more 
individually adaptive and possibly more expensive curriculum by which they 
can learn what is needed to be learned at each grade. 

At the very leasts there must be measurement from one year to the 
next, because of the various problems stemming from the administering of 
pretests in the fall and posttests in the spring. There are apparently 
differential losses of skills over f^i summer, and it is quite possible 
that students who learn well during the school year in the Title I 
program may in fact lose a lot of what they have learned over the summer 
and com^into the next grade further behind their peers than they were 
at the beginning of the previous grade. 

Evidence to support the need for valid, long-term, temporal relational 
information has come from studie, of the long-term effects of Follow 
Through participants in New Haven, Connecticut (Abelson, Zigler, and DePlasi, 
1974; Seitz, Apfel, and Efron, 1977). In that set of studies, which 
involved testing children several times starting in kindergarten and 
continuing through the eighth grade (as of 1977), continued differences 
bertean Follow Through and non-Follow Through participants four years after 
completing the program were shown. Although^ the s^ijiples were quite small 
and the results not truly dramatic in that study, the use of a longitudinal 
design did rr alt in demonstration of long-term gains that have not been 
found in croas-sectio4al comparisons. 

The decision between cross-sectional and true longitudinal C -'^ns 
is complex. Although we can list various advantages and disadvantages 
of each, the choice in any particular situation will depend on the 
values assigned to the various advantages at that time. 
\ The primary problem for cross-aectionai comparison designs is 

establishing the equivalence of different cohorts. Among the factors that 
operate to produce nonequivalence ar6 the i:ollos*ing: 
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1. nobility of students : a significant number of students change 
schools during the primary grades, and that movement is not random; 
th,is, the sixth graders in a school are not likely to be exactly 
the same as second graders in that s rhool will become in four years; 

2. curr Ictdum j va r la t ions ; the ::urrlculum in a school ^and tne teachers 
enployea whio teach the children of the two cohorts will vary; and 

3. population trends ; a general population trend in achievement 
scores %d.ll confound results. ^ 

Another type of problem for cross-sectional designs is that they cannot 
make use of relations in individual data, but murt re 3u group means. This 
means, for one thing, that no information relating v^^xation in personal 
traits and experiences to variM:lon in long-term gains can be extracted with- 
out a true longitudinal design. It also reduces the reliability of the 
reJ^ ilts? the random variation of gains across individuals is substantially 
smaller than the variation of differences becween randomly paired pretest 
and postte<*t scores that could be constructed from a cioss- sectional compar- 
ison.* 

A discussion of the use of longitudinal evaluation in educational eval- 
uation has been provided by Ryan (1974). One comparison of results obtained 
from longitudinal and cross-sectional evaluations of the same e^^ucational 
program (Dyer, Linni, & Patton, 1969) found significant biases in the cross- 
sectional avaluation methods. 

Longitudinal designs take a long time to carry out, however. A compro- 
mise option of overlapping panels of longitudinal cohorts is, on the other 
hand, possibly an acceptable alternative tp straight longitudinal designs. 
As a hyporhocical example, over a three-year oeriod three cohorts might be 
followed: those who in the first year were in grades 2, 3, and 4. In the 
third year, they would be in grades 4, 5, and 6. Whether this design turns 

*TJ;e variance of the (longitudinal) gain measure, assuming equal variances 

(a 2) on the pretest and posttest, is 2a^(l-r^), where r, the correlation 

X ^ 
between an individual's pretest and posttest score, is likely to be at 

least .5. The variance of the differences between pretest and posttest 

scores in a r.ross-sectional design is lo^ (i.e:, r » 0) and therefore 

is substantially higher. 
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out in fact to be adequate for rhe particular information need is an empir- 
ical question. If on the overlapping poriods of such a design there are 
similar relations among different cohorts (e.g., grades 3 and 4 for the 
original second and third grades), then it is an adequate design; however, 
if during the overlapping period there are (iifferent relationships, it will 
be difficult to extrapolate from these overlapping periods to provide tem- 
poral relational information between grades 2 and 6. It may, in fact, take 
up to ten years to perform the correct, valia evaluation o2 Title I. Keep- 
ing this in mind, any contracts for collection of data should be Carried 
out with the assumption that they migjtt be the initial phase of some longi- 
tudinal evaluation that would be completed by some other contract in later 
years. Thus, for example, identities cf particular students should be 
recorded, although carefully guarded from unintended uses, and periodic 
efforts to follow the movement of students among schools should be under- 
taken. This would allow the evaluation of a program at the later years to 
be done in a reasonable time frame for practical policy-making. 

In addition to the problem that they take too long for many decision- 
making purposes, longitudinal studies also incur the costs of correcting 
for attrition of various types of , participants in the evaluation. First, 
students may not be available for all testing sessions, and omitting them 
may seriously affect the findings. Trismen et al. (1975) found, for example, 
that even within a single school year approximately 10% of the students had 
either pretest-only data or posctest-only data. The students who had missed 
one or the other test were not a random sample, for they tended to -core 
lower than the students producing complete data. A method for dealing with 
this attrition, nonrespondent sampling, will be discussed under Issue 3. 

A second type of attrition is among teachers, administrators, and even 
projects being evaluated. If an evaluation measures performance of a set of 
traetmenta over several years, it must "correct for" the fact that the treat- 
ment will inevitably change over years. A third type of attrition is of 
evaluation project staff. To ensure that the project will not be subject 
to breakdowns if individuals change jobs and are replaced, careful records 
of events and procedures (such as telephone conversations) must be kept 
that would not be necessary for a project of shorr duration in which staff 
attrition would be unlikely. 
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Summary . iVlthough it is impossible to give a single answer to the 
topical question of this issue ("is longitudinal evaluation necessary?"), 
it is possible to make several recommendations based on the experiences of 
previous educational evaluations. 

"\ 1. Individual achievement gains should be measured for intervals of 
whole years to avoid distorting effects of time-of-year (e.g., 
differential amounts of experience With the teacher giving the 
test). 

2. Conclusions about pre test-post test gains should not be based on 
comparison with published norms, because the latter were obtained 
on children who took the test only once. 

3. Teachers' retrospective judgments of children's gains should be 
ignored. That does not mean that teachers' observations recorded 
throughout an evaluation period need be ignored. 

4. Longitudinal studies of long duration, making use of overlapping 
cohorts where possible, are necessary for the ultimate inqpact e\al- 
uation of Title I. Such studies are relatively quite expensive, 
but whenever the information they provide is needed in valid form, 
avoiding them is short-sighted. 

5. Any evaluations undertaken without funding for long-term longitud- 
inal data collection should nevertheless take fairly inexpensive 
steps to ensure that the data base acquired can later be used as 
the first stage in a longitudinal study. 
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Sampling 



Introductloa 



Gathering Infoncatlon to test decision rationales Is costly, and program 
^ managers and evaluators should weigh the cost-effectiveness of different 

Informctlon gathering plans much as they would weigh the cost-effectiveness of 
different program strategies. A crucial step that determines the cost and 
effect Ivtaess of evaluation Is sampling. Sampling refers to the process of 
aelectlng a few units from which to gather information (e.g., schools, 
claaarooma, or children) from a large population. There are many variations of 
iMuspling^ an4 the choice among them must be cognizant of both the cost compon- 
l eata of data collection and the nature of the information ne^ds to be satisfied 
in order to ^^provlde maximally effective use of evaluation resources. 

The need for sampling in the evaluation of Title I is apparent when one 
realizes that infoinnatlon is needed on school districts , schools, and school 
children in order to formulate policy alternatives. There are over 17,000 school 
districts in the country, approximately 90,000 schools, and over 40,000,000 
school children, of whom over 40Z attend schools recelvjUig Title I asslstaiJ.ce. 

There are two categories of sampling: formal and informal. Formal sampling 
refers to the process of defining a population (e.g., all second graders in 
Tltl9 I assisted schools) and then prescribing a "sampling rule" that determines 
'which units in the population will be observed. That rule normally contains a 
"random" process,^ but may be "systematic". Informal sampling refers to the 
aelection of units to be observed without clear specification of the population 
and the sampling rule. The advantage of formal sampling is that it provides a 
basis for evaluating how precisely the information gathered on a sample reflects a 
population. Although it is customary for policy decisions to be made on the basis 
of information from informal samples, any support for a rationale based on an, 
informal sample is subject to the criticism that the information gatherer may 
have deliberately selected units to prove his/her point; such an argument is much 
weaker when a formal sampling procedure has been specified. An informal sample 
is sufficient only when generalization to a population is uimecessary ; for 
example, a search for effective projects may appropriately be Informal if the 
objective is merely to find a few, but must be formal if a conclusion is desired 
concerning the frequency of effective orojects in a population. 

ErJc ' hi) 
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k fowl sajnpling procedure will yield a probability sample that is repre- 
saatetive of a population if the relative frequency (probability) of each unit 
being selected is known and greater than zero. Among probabilii-y sampling fiethods, 
there, are numerous variations tMat aim to use. information kr.own about the . 
population in order to reduce the sost of obtaining information. The basic 
Mthod'ia sample random sampling with replacement . In order to select such a , 
•ample, one needs a numbered list of the units in the population and a way of 
generating a list of (pseudo-) random numbers (e.g., a table In a statistical 
textbook). Each successive unit is selected for observation if its number 
- appears on the list of random numbers. The statistical computations are 
simplest for this method of sampling. The first variant is sampling without 
replacement, in whicti if a particular random number occurs more than once on 
the, list, the corresponding unit is nevertheless only select^ once. Because 
collecting repeated information on the same unit causes interpretive difficulties, 
this varWt is nearly universally used, although in practice if the sample is 
less than 5% nf the population-, the two methods should produce essentially the 
same conclusions. 

There are four inore subscantive categories of variation in probability 
sampling: stratification, clustering, multistaging, and proportional sampling. 
We shall only describe them briefly here; the reader who wishes further infor- 
mation can consult e textbook on sampling (e.g.. Hansen. Hu.witz, and Madow, 1953; 
Cochran, 1963; Raj, 1968). Stratification refers to the use of knowledge about 
some factor on which the units in the population vary <e.g., region of the country) 
in order to ensure that exactly the right number of units is selected from each 
"stratum," or level of the factor. Stratification can serve two purposes: (1) 
to Increase the precision of information gathered by ellininating a portion of 
the random error, and (2) to allow iore frequent sampling from some strata than 
others in such a way that mathematical corrections maintain the representative- 
ness of the sample. Clustering refers to the sampling of some superordinate 
units in order to select units to observe. For example, all the major evaluative 
studies of Title I that have reached conclusions concerning chUdren participating 
in compensatory education have first selected school districts (USOE, 1970; 
Glass, 1970;. NCES, 1975. 1976; GAO. 1975) or schools (Trismen et al., 1975), and 
then observed only the children in those selected clusters. If within selected 
clusters, only a sample of the units of interest is to be observed, then the 
sampling procedure is called multistage . The purpose of clustering and multi- 
stage sampling is to reduce the cost of collectinT data; for example, test 
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axiniiilstratloa costs are »ore closely related to the number of testing sessions 

required then to the number of children tested In each session, and children In 

a single classroom can all be tested In a single session. The fourth major 

varlatloa In probability sampling is cluster sampling with probability 

yroportional to "size." In this variation, the probability of a particular 

wperordlnate unit's being selected is proportional to the number of units of 

intersst it contains. For example, selection of school districts might be 

undertaken based on the average daily membership of the districts, so that a 

district serving 20,000 stidents would have 50 times the probabj^lity of being 
* 

selected as a district serving 400 students. The purpose of sampling with 
probability proportional to "size" is to n^dmlze the precision of information 
on the population of Interest (e.g., students) while minimizing the number of 
clusters that must be contacted In collecting the data. 

All of these variants improve the efficiency of Information gathering over 
the basic method of simple random sampling. The costs associated with th^ are 
(1) that they require some further information al^ut the structure of the 
population to be sampled; and (2) that the interpretation. of the data from more 
complex combinations of the variants is more complex, in some cases beyoiid the 
limits of current statistical sophistication. 

The first issue to be discussed in this section concerns (1) the relation 
of information needs to the need for a probability sample and (2) the threats to 
representativeness that must be dealt, with. , 

The second of the two issue discussed in this section concerns the 
relationship of cost of data collection to sample size and the relationship of 
sample size to the precision of the information produced. 
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Vaaug 3. When is pepresentative scawtim vnportant? 

The need to generalize results from a sample to a populatim depends on 
the decision rationale beinf, tested. There are at least three distinctly ^ ^ 
different types of information need that require lifferent levels of represent- 
ativeness: (a) the need to know the average value or frequency of an event 
in a population Ke.g.. the average class size of Title I assisted classrooms); 
(b) the need to know whether two or more variables are related to each other 
(possibly casually) (e.g., an instructional method and amount of student progress;; 
and (c) the need for some examples of a type of event (e.g., a successful 
project). In discussing thio issue, we shall consider both the levels of 
i-epresentativeness needed for each type of information and the two major 
threats to representativeness that miat be dealt with: misinterpretation based 
on contusion.of units of analysis and misinterpretation based on lack of usable 
data provided by some of the selected w^itS (i.g., nonresponae bias). 

For the first type* of Information need, estimates of parameters of 
program operation, strict quanti'.ative representativeness is a necessity. 
For that reason, the results of the THff^ study (Mosbaek, 1968), the 
■ aggregations of annual state reports oAitle 1 (Wargo et al., 1972; Gamel et al.. 
1975; Thomas and Pelavin, 1976), and the GAO study (1975) cannot be accepted 
as quantitatively accurate pictures of national program operation. The USOE 
' surveyoi (USOE, 1970; Glass, 1970), and the NCES surveys (1975, 1976). on the other 
hand, do provide quantitatively accurate generalizatipns to the national 
population, insofar as. the information gathered from the samples was accurate. 

Turning to the second type of information need, whenever the conclusions 
to. be reached concern the existence of ations that should appear within any 
given project, such As between methods and Impact, it is not essential that 
the project (s) observed be quantitatively representative of a population. 
The conclusions would be questioned, however, if the ptojects selected were 
especially unusual on some dimension; therefore, some effort is worthwhile to 
select a project or projects that are reasonably representative of a population 
to which one wishes to generalize. Obvious examples are experimental demon- 
strations tha' are selected for the particular processes to be investigated; 
the implied goal of such studies is to determine better methods for compensa- 
tory education that can be used by the school system at large. As part of the 
Compensatory Reading Study, a sample of schools that were either especially 
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•ffMtivt or especially ineffective ana that varied, across clusters \of methods 
used was selected for in-depth observation. From t ha t^ investigation, > the 
resMrchers were able to identify attributes characteristic of effective 
schools. While the results cannot be guara^jteed to generiflize to all school 
districts, they serve a tiseful purpose in the incremental increase of put 
general understanding of how to design compensatory education proje^rts. Other 
research studies, such as M. McLaughlin's (1971) and those cited by Gordon and 
Kotttrelakos (1971), provide quite interesting reconmendations for improving 
compensatory education, and althpugh there are grounds* for questioning the 
validity of their results from a design perspective, the lack of a representa- 
tive national sample is not one of these grounds. 

As an example of a hypothetical case in which achieving iantitative 
representativeness could actually distort the results of a relational study, 
consider the data in Table 4. If two ifectors, A and B, arv^ correlated in the 

Table 4 

Hypothetical Example of a Distortion 
Produced by Quantitative Representativeness 







X 


Factor 

Low 

N 


A 

High 
X N 


Total 
X 




Low 


10 


160 


20 40 


12 , 


Factor B 










18 


0 


High 


10 


40 


20 160 



population, then reflecting that correlation in t!he sample, as shown by the 
• columns labeled "N" in the table, could^ result in a spurious conclusion, in 

thlfa case that factor B was a predictor of scoxes (X). Examination of Table 4 
shows t Factor B is not truly directly predictive .o£ scores; only through 
its p )ciation with Factor A is it correlated with scores. Although data 
colle^ according t^ representative sampling rules can be treated statisti- 
cally to produce .xstorted results concerning relations, that treatment can 
be quite compl^ Data colUqted according nonrepresentative but orthogonal 
(uncorrelated) sampling rules are easier to interpret. 
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Th« third type of study, popular in |he federal educational administration 
bttcauM of its potential for producinlg large b<>nefits, is the search for 
MCCMsful» exemplary projects that can be packaged and disseminated. 
Rftpresentative sampling is not needed to satisfy this type of information need. 
It is much more efficient to use any informal sampling methods available, 
such as consulting prograr experts, in ordei^ to focus observation on^the 
•uccessful projects. This type of study has had a recurrent^ problem, however, 
that may be due either to problems with the method of identifying outstanding 
projects or the problem of capitalizing on chance occurrences: later 
evaluations have in many cases not clearl^ corroborated the success of the 
projects identified earlier as exemplary (Warpo et al., 1971; Steams, 1977). 

^ 

To summarize the needs for representativeness, the method of selecting a 
sample for an evaluation study is dependent upon the o^ectives. Studies 
aiming to identify relations among processes and outcomes should avoid random, 
xepresentative sampling in ^avor of sampling for significant variation in 
processes and outcomes. Studies aiming to assess parameters of program 
operation statsvide or nationwide, on the other hand, must obtain representa-- 
tive samples in ord^r to provide accurate, unbiased information. For example, 
we would not require a study that found individualized instruction to produce 
reading gains to have a nationally representative sample, but we wotild require 
representativeness of a study that reported that blacks tended to receive 
compensatory instruction relatively more frequently than whites. In general, 
this issue is not as controversial as some others, primarily because the 
methodological problems .have apparently Seen at least approximately solved. 

Turning now to the threats to representativeness, the first threat 
(misinterpretation based on confusion of units of analysis) is a semantic 
problem that merely requires sophistication on the part of the evaluator to 
avoid 'erroneous statements of conclusions. The second threat (misinterpreta- 
tion based on nonresponse bias) is, a substantive problem requiring careful 
attention In the planning and execution of data collection as well as careful 
Inte^pretaticQof data. 

The problem of cbnfusion of units of analysis arises when one uses 
clustering or lultistage sampling. The simplest way of avoiding confusion" is 
to state results in terms of an "observational" unit that is equivalent tojthe 
clustering unit. Observational units are units that are referred to in state- 
medts summarizing the results of the evaluation. Thus, a statement in the 
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conclusion of an evaluation report might be either "compensatory education 
projects in 'the sa]iq>}e tended to vary greatly in ..." or "compensatory 
education students in the sample tended to vary, greatly in ..." Each state- 
ment presumed a particular, type of observational unit. Sampling units, as 
opposed to observational units, are the units, whose relationship to a popula- 
tioiiv of interest is known. It is important to establish the observational 
unit that is, crucial for the information needed and thAi to select sampling 
units \so that statements can be validly made in terms of thos observational 
unlts.\ The possible observational and sampling units for Title I include: 

1* ^ students, 

2. teachers, 

3. groups of students receiving a particular service, 
"4^ classrooms, 

5. schools, 

6. school districts, and 

7. states. 

Is it reasonable to sample schools within a state and make statements 
about: students? The answer is generally "yes." However, when the schools 
do not exactly represent the proportions of students in the population for 
which generalizations are to bQ made, then the mean.^scores for the schools 
must be weighted differentially dujrlng aggregation. 

Basically, if the observational unit is to be students, then to produce 
stable, unbiased estimates for the population of students based on a Sample 
of schools^and testing of a specified set of students in each school), It is 
most efficient to sel 2ct schools in such a way that the likelihood of each 
school being selected is proportional to the Tiiiinber of. students in the school. ^ 

A problem that can arise If one is not careful iA jusing differing obser-. 
vational and sampling units (e.g., students and school^) is in making observa- 
tional statements that in fact depend on the way in which observational units 
are distributed within sampling unit^. Such an error occurred in the Compensatory 
Reading Study (Trismen et al., 1975). The authors noted (p. 75) that minority 
disadvantaged students fended to receive compensatory instruction in Separate 
classrooms, while white disadvantaged students tended to receive it in 
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classrooms combined with non-disadvantaged students. But their conclusion, 
"it seems that such student assignments are being mad^ at least in part on the 
basis of ethnicity/' overlooks the structure of their sampling. In fact, it 
is equally plausiSle that these effects were between schools and that schools 
with especiafxy large minotity enrollments were also those that, for other 
reasons, had chosen tp use separate rather than combined classes for 
compensatory reading instruction. This possibility would have been easily 
testable had the analyses taken into account the difference between the 
sampling method (by schools) and the units about which the statement was 
intended to be made (students) . 

To summarize, clustering or multistage sampling requires some care in 
interpretation of data that is not necessary in studies employing simple 
stratified random sampling. Otherwise^^ conclusions can be reached and 
rationales supported that are in error. 

The other threat to representativeness is nonresponse. This Important 
aspect of ^sampling, which occurS in practice but is not usually covered in 
elementary statistical texts, is the problem posed by sampled units that do 
not choose to participate. For example, 'in the Compensatory Reading Study, 
731 school districts were carefully selected (in Phase I) as candidates for 
the sample, but th«i only the first 222 who r ^aputi ded that they were ready to 
be involved in the study were actually included (in Phase II). Another way 
in which non^sponse bias can occur is through the reporting of invalid or 
unusable data. The fummaries of ai^nual sta»te rfe^orts (Wargo et al., 1972; 
Gamel et al., 1975; Thomas and Flavin, 1976) have suffered from the fact 
that although reports werp availaW^ for the large majority of states, most 
of the reports did not* provide the quantitative iaformation needed to produce 
a national sumi^ary, especially of achievement gains from Title I. That 
nonresponse bias can be important for some variables and not others was shown 
in the national Title I sunreys of 1967-68 and 1968-69. In these surveys j 
although response was good for questions of participation and service delivery, 
it was completely inadequate for questions of impact on achievement — only 
6Z or 7Z of the districts provided adequate achievement results. 

This kind of sampling problem, nonresponse bias, is difficult but not 
impossible to handle. The first step is to compare what data are available 
from the nonresponding'^ units to corresponding data on responding units to 
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test vhether responding and nonrespondlng units are really from different , 
populations^. If no difference Is found on a variety of characteristics 
related to the variables of Interest, uonresponse may not 'contribute a great ^ 
deal of bias to the study results. Al*so, If nonresponse Is limited to feWer 
than' 10% of the sampli^ units, as a rule of thumb, then the bias « introduced is 
llkttly to be unimportant. \ 

Some differences between units that do and do not respond are very likely 
to be observed, and a nonresponse rate of greater than lOZ is frequent. There 
are two solutions In this case. (1) If oft various stratifications of the sample 

■P 

there are at least some units in each cell who respond, then the results from 
the units that respond can be weighted accordingly to stand for both themselves 
an^the units that did not. For example, if in stratum A of a sample of schools, 
4 of 10 schools^ participate, and in stratum B, 8 oi 10 participate, each score 
in stratum A should be weighted by twice as much (the ratio of 8/10 to 4/10) as 
the scores for schools in stratum B. (2) One can choose a small sample of the 
'nonpar ticipants and by' intense efforts gain their participation. From these 
comparisons, estimates of nonresponse bias can be obtained. Such nQn«> 
respondent sampling and followrup is, crucial to the validity of any estimates 
of population statistics when fewer than 75% of the sampled units agree to 
participate and do .In fact produce usable data. 

Nonresponse bias is especially a problem for longitudinal studies. When 
gains are measured from pretest to posttf^st, the mobility of children between 
schools car Substantially affect the conclusions reached — if children who 
leave a particular sampled school tend to learn m<lre slowly than those who remain, 
appatent gains will be greater than if all the children were tested at both 
times. It ±r. clear th^t in order to provide meaningful analyses of pretest 
V. to post test gains, the same students must be included in both pretest and 

posttest samples. This means, based on the examination of nonresponse bias 
in tHe Compensatory Reading Study (Trismen et al., 1975), that the children 
included in such analyses will tend to be substantially less educationally 
disadvantaged than nhe totality of children participating in compensatory 
educaM.on. In the Compensatory Reading Study, the choice was made to analyze 
gains for instructional group means that included aU children who took either 
the pretest or posttest. Although, that choice ensured that the most disadvantaged 
tfhildren were included in. the analysis, the meaningfulness of "gains*' computed 
between pretest and posttest groups containing different children is highly 
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questionable: any gains would be confounded by mobiliuy effects.* The nly 
apparent solution to the mobility problem Is to analyse the data according 
^to a more sophisticated model that treats student mobility and ocher causes 
of nonresponse as components of the system and evalu^^es them as well as 
achievement gains of students who take both pretests and posttests. This 
woold require tracking down and posttestlng at 1 ^st a small representative 
Bmmp^e of pretested students who are not present for the posttest. 

There are two general recommendations that follow from the po5-its made 
In the discussion of this issue.^^From these, many specific recommendations 
for procedures can be derived. ^ 

.1. Sampling plans for evaluation should be carefully related to 
the Information needs to be satisfied. Nationally repr<»senta- 
tive samples are necessary only when quantitative estimates of 



Impede the gathering of certain other types of Information. 

2. Plans for the analysis of data should be carefully examined 

prior to sampling for their Implications on sampling procedures, 
and vice versa, so that the i .blems associated with use of 
different obseL^atioual ana sampling units and with nonresponse 
bias can be fores^^en and dealt with in the context of a single 



* The use of 1: lutuctlonal group means in the Compensatory Reading Study alio 
suffered from the fact t'^t the few children who were in compensatory classes in 
the fall and regular reading classes in the spring (presumably because they^ 
Improved significantly) would have their pretests counted ^n the compensatory 
group means and their posttests counted in the regular group means. 



program operating characteristics are needed, and they may 




comprehensive system model. Only then can the data collected 
by comfortably accepted as representative of prograr* operation. 
This recommendation goes beyond sampling -ad will be elaborated 
in :he discussion of measurement and analysis issues. 
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J g^ug 4. How large a aampte is neoeaaarr^? ^ 

The choice of sample size for federal social program ^^^alr .tions is 
largely' arbitrary. On)B can obtain useful information from obsei/lng one 
school or ten thousand. Although a quantitative methodology exists for 
determining the sample size needed for an Evaluation as a function. of the 
precision of the information needed, the need for (and, therefore, value of) 
precision is nearly impossible to quantify. For example, for riost policy- 
making. It is immaterial whether finding ;:hat an event occurs 30% of the time in 
a sample means that 19 times (samples) out of 20 the population percentage 
would be between 251 and 35Z or between 20Z arid 40Z. Yet the sample ?ize would 
have to be roughly four times as large in the former case as In the latter. 

In the case of compensatory education evaluations Involving achievement 
gains, a plausible criterion for information precision has been suggested: 
that ol serving a gain which is "educationally significant'^^ a sample should 
allo#^>ae to infer that at least 19 times out of 20 that gain wo^ not be 
purely by chance. This criterion depends, of course, on an aecepfable 
definition of educational significance. Britx discussion of the use of this 
criterion to determine sample size and of the relationship between sample size 
and Information gather-^ng costs is as far; as the present consideration of 
sample size will extend'; Readers who wi$h further information are urged to 
consult a text on survey sampling (e.g., Raj, 1968). ^ 

To determine sample size, we need to, consider not only the total sample 
but also the size of the groups that we want to compare. As the evaluators 
of Head Start found out nearly a decade ago. It was not sufficient just to 
•btaln a sample of 100 Lead Start programs, because It turned out that- the 
sample Included only 30 full-year programs, as opposed to summer programs. 
This did not provide a sufficient data base for statements describing the 
effectiveness of che ful^-year programs. If the design of the program evalu- 
ation calls for sampling in ten different categories (e.g. , • gra'^es, project 
treatmait types), the sample size in each of these categories should be 
determined so that a stabile ^ean can be estimated for that category. o 

Some authors have pri>posed that for educational program evaluation a gain 
or difference of one-third of a population standard deviation y cou lered 
educationally significant (e.g., Horst, Tallmadge, and Wood, 1975). While 
nobody claims that this criterion of aducational significance is "correct," the 
fact that it has been referred to repeatedly demonstrates the need for some 
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such criteribn, and research to establish a criterion is called for. For the 
purposes of samt>le size determination, we can use this criterion in calcula- 
tions exemplified^. by the following simule experimental design. Let us assume 
that we want one-third of a standard deviation difference between two groups 
(a Title I treatment and a control group)' to be significant at the .01 level 
on a two-tailed test. That is, we want the likelihood of observing that 
difference (or larger) by chance alone to be less than one in a hundred. This 
leads by simple algebra and an assumption about the randomness of the chance 
effects to an estimate of the sample size. 



minimum difference 
to be detectable 
within group stan- 
dard deviation 



or reordering this equation: 




essary size 
each group 



-1 



1 Tcritical value corres- 
ponding CO reliability 
of detection desired, 
from tables of the t 
distribution 



[necessary size"! 
for each group J 



- 1 + 2 



al 1 [within group 1 
of t J * [. standar d deviation ] 



[^["critical 
L value 
iaiinimum difference"! 
to be detectable J 



The critical value of t corresponding to a .01 significance level is 2.58; 
so, if it is necessary to attribute a difference that is K times as large as 
the typical random variation of scores within groups, the necessary sample size 
is given by: 

N - 1 + 13.3/K^. 

If the mj.nlmum detectable difference were to be one third of a population 
standard deviation (determined from published test norm Lubles) and the 
standard deviation within each cf the two groups being compared were one-half 
the population standard deviation (K = 1/3 ^1/2 - 2/3), then the required 
sample size would be 31 in the treatment and 31 in the comparison group. 
If we were satisfied with a .05 level of significance, the necessary sample 
sizt would be about 20 in each group. Thus, it is usually unreasonable to 
expect that a teacher should be able to evaluate the effectiveness of a 
compensatory reading program on the basis of his or her students in a single 
class, because the class will not be large enough tc allow detectloA of some 
educationally significant differences be.ver- treatment and comparison students 
Moreover, if that teacher can clearly see s.ch a gain, it must be quite a bit 
in excess of the minimum ne^^dad to be evidence of "eduational significance." 
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On the other hand, in school districts of moderate size or larger, there 
certainly would be enough students to be able to carry out an evaluation of 
their compensatory education project using the one-third standard deviation 
criterion of educational significance. 

We should remind the reader chat the selection of the minimum effect to 

be detectable was arbitrary and it was crucial for the calculation. Thus, it 

l8 crucial for the final resolution of the sample size question to determine 

the exact form of the comparison to be made. To take a different type of 

comparison, suppose we wished^ to ^ ompare two different treatment groups on the 

percentage of participants acMeving a particular minimum proficiency level. 

If we wished to be able to reliably (at the .05 levjel) detect any differences 

in percentage of 20% or more (e.g., 50% vs. 30Z or 90% vs. 70%), an escimate 

of the required sample size can be obtained as: 

r T /r^jjjj^jT fnormal deviate corres- | 

rpercent difference! / Tstandard deviationi . ponding to .05 level 
Lto be detectable J / lof the difference J [^^. significance J 

1 



[percent differencel ^f^e required ofl . 
[to be detectable J y [each group J - 

" 2 \^ .20/ 



or N - (^^=^1 ^ - 48 
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Any calculations of sample size are critically dependent on the needed minimum 
level of reliably detectable effect. In tradeoffs with other cost dimensions, 
evaluation designers should decide with program managers what precision is 
needed in terms of the use to which the results are to be put. 

There is another aspect of sample size that must be considered. Any eval- 
nation of a program such as Title I is carried out over a particular geographic 
and demographic area. A school district may be interested, for example, in 
evaluation of the program within its district, a state within its state, and 
the USOE. and Congress may be concerned with evaluation across the whole country. 
In each case, it is not sufficient to sample a single unit, such as a school, 
even though there may be a sufficiently large number of students in that school, 
because the particular attributes of that school mignt be quite different from 
the attributes of schools across the district, the state, or country; these 
differences might lead to quite different conclusions with respect to the 
effectiveness of Title I, depending on which school was chosen. Thus, the 
sample must include units chosen to represent the total variabilit: across 
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Cost - (Nj^) (L^) (Hp + ... + $Aj^ (N^) (Lj^) (H^^) ^ 
+ $B^ (N^) (L^) + . . . + $Bj^ (Nj^) (Lj^) 
+ (N^) ,(L^ - 1) + . . . + $C|^ (N^) (Lj^ - I) 
+ $Dj (H^) + ... + $Dj^ (Hj^) 
+ $E . 



N9tatlon: 

The subscript 1, K refers to different classes of iadividuals who 

must be contacted or tested during data collection, such as students, teachers 
local school administrators, and state administrators. , 

N refers to the number of each type of individual involved; 

L refers to the number of contacts over time with each individual; and 

H refers to the depth, or length of each contact. 

The costs are: ^ " 

$A is the cost per unit time (or depth) of lecting data from 
individuals of type i, once one has contacted the individuals; 

$B^ is the cost of each locating and reaching an individual of type i; 

$C is the cost of keeping track of him/her for subsequent data 
i w It 

collection, in a longitudinal design; 

$D^is the cost of preparing the contact and data gathering procedure 
for individuals of typ^ 1; that is, the instrumentation cost; and 

$E is planning, management, analysis, ard reporting cost^ 



Figure 4. A first approximation to estimation of information-gatherl^ig costs 
in an evaluation. 
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the population, and the necessary sample size would apply to the number of 
schools selected, not the total number of students tested. This implies that 
costs are not merely for testing each student, but rather that costs associated 
with setting up observations at each school or district, irrespective of the 
number of studentis tested, must be included. 

Having established the size needed for a study, the cost of it can 
^ughly be estimated using a computation of the form shown in Figure 4. 
Clearly that figure is an oversimplification, which can be refined dramatically 
for different types of evaluation. Comparison of the cost with an estimate of 
the benefits to be gained from the evaluation would provide^ a rational method 
for idlng whether to carry put the evaluation. On the other hand, in the 
real world in which the benefits txoiu evaluation are nearly Impossible to 
estimate beforehand, the comparison is usually with a prespecified budget 
allocation for evaluation. In the case in which the estimated cost exceeds 
the budget allocation, which is the most frequent situation (at least from 
the point of view of proponents of planning and objective^ rational 
decisionmaking), decisions must be made of which information needs should 
remain unfulfilled in the study or what precision should sacrificed. 

In summary, the point of this discussion is first to demonstrate that 
there are met'^ods for determining sample size from knowledge of information ^ 
precision needs and information costs, but second, to note that the specifica- 
tion of information precision needs is still only vaguely understood in 
educational evaluation. 
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Measurement ^ 

Introduction 

Measurement refers to the process of assigning numbers to represent 
constructs, objects, or events of Interest. The purpose of assigning numbars 
Is to make It possible to aggregate and compare different events easily (e«g., 
it Is easy to compare two test scores, but can be laborious to compare the 
unstructured beha/lours of two students In a classroom) . There Is a^extens- • 
ive literature on the mathemaclcal foundations of measurement of which an 
expert evaluator must be knowledgeable, just as he or she must be owledgeable 
of^ the mathematics of experimental desi^ sampling, and data analysis. The 
general purposes of th^t literature are (1) to develop neu methods for measure- 
ment and (2) to establish and delineate the neaningfulness of conclusions 
based on measurements. The principle underlying the second purpose is. that 
measurement ^should not distort reality; conclusions based on comparis<ms of 
numbers te ^ i t:- i\g from measurement should be the same as the conclusions one 
would reach if the constructs, objects, or events being measured were directly 
compared without assigning numbers. 

The measurement issues to be discussed In this section concern the Impact 
of compensatory education on e^-lucational disadvantage. Knowledge of the 
intricacies of cost and expenditure measurement are also of Importance for 
program evaluation; readers who vish to find out about these intricacies in 
the context of comoensatory education evaluation should read the cost analysis 
report by Dienenann, Flynn, and Al-Salam, (1974). The problems of testing 
are the more contrc ersial measuremtnt issues related to compensatory education, 
however, and will n seive major attention here. 

Achievement meat rement is the central cask in the evaluation of compen- 
satory education progn s. At a recent national conference on standardized 
achievement testing of disadvantaged students (Wargo and Green, 1977), Wargo 
noted that: 

A major reason for the increased use of standardized achieve- 
ment tests in elementary and secondary education program 
evaluation relates to the general thrust of school aid at those 
levels. Most federal fLi^ancial support programs for local edu- 
cational agencies have as one of rhe primary objectives, if not 
their primary objective, the overcoming of educational 
disadvantages suffered by students from low socioeconomic status 
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£a«ilie» or from culturally differer' backgrounds. The 
translation of such legislative goals iilto program objectives 
usually means, a focus on Improving the basic skills (reading, 
writing, and tiuithematics) of such students. That combination 
of legislative knd programmatic thrust serves as a major 
impetus for evaluation specialists to s lect off-the-shelf 
standardised achievement tests for determining local, state- 
wide, and national education program Impfit. (p. 4) 

Also, the U.S. Office of Education's current major efforts to^provide technical 
assistance to states and local districts in their Title I evaluations centers 
around a set of models for collecting and analyzing achievement data. 

Deficiencies in the measurement of achievement have shared with defic- 
iencies in use of control groups (Issue 1) the major focus of controversy 
surrounding evaluations of compensatory education. Other measurement issues 
in Title I evaluation do not meet the political stress engendered .by the fact 
that certain ethnic groups tend to score lower on achievement tests than others. 
Furthermore, because achievement tests are frequently used as mechanisms of 
personnel selection for high-paying jobs and higher education, there is an 
implicit threat in the use of- achievement tests in program, evaluation that the 
Individual's scores will somehow later be used against him/her. 

The consideration of measurement issues is divided into three parts. First 
there is the problem of identifying and selecting which constructs .to measure; 
should one, for example, measure overall progress in "learning to read" cc 
should one measure jcomponent skills learn'id? Also, to what extent is it the 
role of evaluators-fo measure noncognitive benefits and side-effects of program 
operation? Second, there is the selection of an instrument; although that 
theoictically should follow after selection of constructs to test, the usual 
situation in practice is that the availability of tests determines which- con- 
structs are tested. A very controversial aspect of the instrumentation issue 
is whether or not to use criterion-referenced -tests. The third issue concerns 
the manner of recording of scores to be used in analysis. As such, it is on 
/the border between measurement and analysis issues. However, because Its con- 
troversial aspects relate to the content of test publishers' manuals rather 
than to experimental design, we have included it in this section. A subtitle 
for the issue: "Are grade-equivalent scores really that bad?" reflects the 
focus of controversy on this issue. 

The aim, as in earlier sections, is to inform the reader of the content 
of the issues, to point out the critical problems, and to suggest wavs in which 
the issues may possibly be resolved. ' 
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IsBUie 5. What aonetruate should be measured to determine Title I impact? 

Within schools In low-lncoiBe areas. Title 1 prescribes that services are 
to be provided to educationally disadvantaged children in order to ''meet their 
special needs". Educationally disadvantaged children have been defined as 
those who are Judged not to be likely to be graduated from high sc^liool (USOE, 
1970; Glass, 1970), or who are Judged at least a year behind the achiev^ement 
levels expected of their age group (6A0, 1975), using subjective Judgments or 
scares on achievement tests. Special Instructional services are to be provided 
to all the specified children, and special nonlnstructional services can be 
appended to the program to supplement the instructional services. Therefore, 
measures of Impact must reflect the extent to whicli achievement levels are 
Improved by the program, and the constructs measured must be those that relate 
to achievement. That does not imply that achievement test scores are the 
only criterion for Impact ^evaluation. In fact, children are in schools for a 
dozen years or more, and achievement levels In higher grades may depend on 
many factors other thap achievement In the first few years of school. (1) 
What factors are related to achievement? (2) Should achievement be measured 
in wholistic terms (e.g., can Johnny read?) or in terms of component skills? 
(3) Should achievement be measured in terms of scientific theories of 
achievement or in empiricist terms of "what achievement tests test?" Until 
such questions are addressed. Impact evaluations will suffer from charges of 
"narrowness" and "superficiality" and even "irrelevance" of their outcome 
measures, and therefore of their conclusions. The discussion of this issue 
will focus on these three questions. : 

The flAt question, in practice, concerns the relationship between 
attitude axf^ achievement. Improving children's attitudes is viewed by many 
compensatory education teachers as an Important objective for their activities 
they believe that its ultimate payoff In terms of achievement may be .much 
greater than the learning of a few specific components of reading. The ev^tdence 
is lliixed concerning that relat^-^nship, however. Shavelson, Hubner, and Stanton 
(1976) cited studies that empirically support the notion that improving a child 
self-concept will lead to achievement gains. Project LONGSTEP (Coles and 
Chai'upsky, 1976, Vol. II) found a positive correlation between an attitude 
composite and achievement scores; however, the Compensatory Reading Study 
(Trismen et al., 197?) found a negative correlation. The degree of standardi- 
zation of attitude measures is as yet insufficient to allow one to compare 
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these different results; in any case, before attitude measures^ can become 
acceptable, indicators of ultimate achievement ^ects, a substantial amount 
of research into the strength of that relationship - and into the ways of 
enhancing the relationship - is necessary. Tha«, our conclusions are (1) 
that attitude measures can play only a supplementary role to achievement tests 
at present for determining whether a Title I treatment is having impact on 
achievement, but (2) that it is likely, when adequate research is available, 
that some kinds of attitude improvement will be shown to be a reasonable 
short-term goal for treatments that aim f^r long-term achieve, ant gains, so 
attitude measurement should not be discouraged. f 

Assessments o.f achievement in ^Title I evaluations have tended to focus 
on reading, language arts, and mathematics. The question of whether it is 
achievement in general or the mastery of particular skills related to achieve- 
ment that should be assessed in these evaluations is of concern to special- 
ists in each of these areas. In order to simplify discussion, we have (like ^ 
Trismen et al., 1975; GAO, 1975; and Thomas and Pelavin, 1976) chosen reading 
achievement as the example from which generalizations can be made to language 
arts and mathematics achievement. The second issue referred to above 
is whether or not it is reasonable to assess reading achievement in terms of , 
specific skills (e.g. decoding,- memory, inference, visual acuity, specific " 
vocabulary), each of which alone does not constitute the ability to read, but • 
that are component skills that are believed to contribute to reading achieve- 
ment. The case nas frequently been made (e.g.. Steams, 1977) that standardized 
tests such as the Metropolitan Achievement Test and the California Test of 
Basic Skills almost completely fail to capture the content of particular 
remet^lal or compensatory reading programs. The reason given for this failure 
is that Title I teachers typically focus their efforts on specific skills that 
are related to reading achievement rather than on reading achievement itself. 
If the participating children have clear needs for which such intense focused" 
effort is warranted, which is undoubtedly the case for many, then assessment 
of progress in terms of tests most of whose items require skills not addressed 
by the treatment seems unfair. On the other hand, focusing on a particular 
component skill may not ultimately enhance reading achievement. As with attitude 
outcomes, it seems necessary to include in the evaluafeioa-of a treatment 
some measure of overall 'reading achievement (possibly one or more years after 
the treatment, which is not in conflict with the need for annual evaluations). 



73 

Th^ third question to 6e addressed in this discussion is whether evaluations 
should be firmly ba&ed In scientific theories of (reading) achievement or 
wfaetitiier they should ^be firmly based in empirical pragmatism; measuring what test 
publishers call achievement. Of. course, firm grounding in theory is 
preferable - if the theory is correct. There are many theories, or mpdels, 
oi the process 'bf learning to read, however, and at least some of them must 
be W^ong. In fact, it is likely that there are many different ways to learn ^ 
to read, even' for a single IndivjLdual,' so measurement would have to be in terms 
of alternative theories for learoirg to read. Williams (1973) has reviewed models 
for learAlilg to read and lists .six categories of theories: taxonomlc, psychometric 
b^havioral,^ cognitive. Information processing, and linguistic.^ A synthesis 
of 'the many perspectives on cognitive achievement ;Ls clearly needed as an 
initial step. If we are to be aSle to evaluate Impact directly in temra of the 
achievement of new cognitive skills rather than indirectly in terms of the 
po'ssible use of those cognitive skills to answer questions on an "achievement 
test". , It should be pointed out, in fairness to the developers of commercial 
tests, that many of them have, especially in recent times, attempted to select 
Items^for tests in such a way that scores for particular subscales of items can 
be Interpreted in terms of specific skill mastery.* 

The value of a firm grounding of compensatory education evaluation in the 
theory of cognitive achievement should be clear. Such controversies as to 
whether students participating in a Title I treatment should be expected to 
learn 70% as much as the median student in a particular time period, or 90% 
or 110%, are based on a lack of knowledge of just what types of skills should 
be learned and are being learned by Individuals who at the beginning of 
treatment have sotoe other particular set of skills. In terms of an adequate 
theory, an individual child's level of achievement could be characterized 
either as th^e constellation of skills that he or she has acquired, or for the I 
purpose of summarization, the proportion he or she has completed of the total 
learning effort needed to reach an ultimate achievement goal. Although the " 
research needed in order to Implement this approach is quite substantial, It 
would appear to involve no scientific procedures that are not presently feasible. 

To summarize our conclusions concerning t]ie selection of contructs to 
measure in evaluating Title I impact, (1) it appears reasonable to use 
attitude and other noncognitive measures as supplements to achievement measures, 
although substantial further research on the relationship between cognitive and 
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^ noncognitlve measures is needed; (2) the same conclusion holds for component 
\ skill measures ate for attitude measures — they should be supplements to 
\ overall achievement measures; and (3) evaluation will be much more useful 
\whett- based on a scientific theory of cognitive achievement; however, the. research 
Vo develop a sufficient theoretical framework is substantial. All three of 
^ese conclusions are similar in their ambivalence; what we have now is 
B^nlmally adequate, but with some research into the processes that Title I 
is\ intended to affect, a significant improvement in impact evaluation would 
be Vossible. Until that research is undertaken, skeptics of evaluation wili. 
havi reasonable arguments that the use of any particular measdrement instrument 
yields results that too narrowly define the purpose of Title I, or that are 
irrelevant to the goals of particular Title I treatments, or t-hat are too 
superfjicial to capture the essential impact of a treatment. ^ 
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leeue 6* What types of adhievement measurement instruments should be used in 

Title I evaluation? 
^ ^ I J — * 

There appears to be no reasonable and efficient alternative for measuring 

program impact on a student's achievement level to requiring him/her to pro- 

diice answers on a paper-and-ptncil test. Therfe are literally thousands of 

alternative tests, and any. teacher may construct a new test to fit any occar 

sion. The major alternatives for test selection are (1) between a locally 

developed teat and a standardized test and (2) between a criterion-referenced 

test and a non-crit6rion referenced test. The choice must be made in terms of 

the particular objectives of the evaluation and will reflect a tradeoff of 

some values for others. For the choice between a locally dovelojJed and a / 

nationally standardized test, the relevant factors are: (1) the credibility ^ 

inherent in use of a test being used by^many others, (2) the availability of 

norm distribution tables for the standardized test, (3) the possibility of 

tailoring a locally developed test to reflect local objectives and instructional 

methods, (4) ease of aggregati on of data across sit es wheu.4^taadardlged-testa 

are used, and (5) the relative costs of buying a test ^om a commercial 

publisher and generating items locally. For small, informal evaluations, tne 

choices will clearly be different from the choices for a national evaluation 

whose validity is likely to come under attack. 

To choose between criterion-referenced t^sts and tests not so designed is 
a matter of some controversy,, prinarily because of the strong arguments and 
large investments on both sides. Basically, a criterion-referenced test is 
one "that is deliberately constructed to yield measurements that are c^rectly 
interpretable in terms of specified performance standard^" (Glaser & Nitko, 
1971, p. 653) or one whose score "has some sort of meaning in itself, Irrespective 
of the scores for specified groups" (Shaycoft, 1976). . Itens on criterion- 
referenced tests are systematically derived tt^m a set of objectives or 
rationales to be measured rather than by statistical item analysis of a large 
item-pool. Until quite recently, commercial tests were not eloped to^be 
criterion-referenced.* Instead, to provide meaning to raw cores, tables 
were provided showing what percentage of the population achieved each raw score 
level; . that is, the tests were norm-referencjgi. jNote that the concepts of 



* That is not to say that good commercial normrref erenced tests have not been 
designed to contain items whose rationales are that right answers to them 
indicate the achievement of particular Skills (see, for example, Flanagaif, 1951). 
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criterion-reference and norm-reference are not per se incompatible (test scores 
can have both absolute and relative interpretations); however, the methods of 
dev^^jpinrthe tests are quite different. Nprm-referenced test? are developed 
to be sehsitive to individual differences among students, whereas criterion- 
referenced tests are developed to be sensitive to degrees of skill attainment 
for each individual. * 

The relevant factors for choosing between standardized tests that are ' ^ 
norm-referenced ©r criterion-referenced are: (1) the relevance of the - * . 

content of the test, of whichever type, to achievement constructs hewing meae-^.-ed; ' 
(2)' the type of evaluation comparison being made (see Issue 1); (3) the volume 
of data desired; and (4) cost and availability. For the Informal local evalu- 
ation (e.g., weekly progress quiz), a teacher is well advised to emulate 
the principles of criterion-referenced test development rather than deliberately 
selecting items likely to demonstrate different levels of achievement among 
students. The choice for large-scale evaluations is more difficult. 

In prder to clarify the selection problem, we shall consider various 
arguments for and, against, first, norm-referenced id then criterion-referenced 
tests. 

N orm-referenced tests are sei of items, the distribution ^of responses 
to which is known for a sample representative of some population. They offer 
both the advantage of enabling test scores -f . be interpreted In terms of 
comparisons tc the population and the advantage of credibility, in that they 
were not developed by the individual who- teaches the knowledge and skills. The 
criticisms of norm-referenced tests deal 'almost exclHisively either with the 
appropriatenesa of the noiming process or with the me-hod of sfelectioft of 
item contents to include in the test. The norming problems may be solvable with, 
sufficient funds, because they stem from incompleteness of the data on which 
norm tables are based; however, the problems with item selection suggest the 
need for new kinds of tests. 

There are eight specific categoric? of problems with norm-referenced tests ~ 
they do "not/ necessarily all. apply to all norm-referenced tests, but they" do 
apply to many. After listing the eight we shall discuss them in detail.^ 
* 1. Norms are based on a population different form that for which 
compensatory education is intended. 
2i Norms are not longitudinal, so norms for gains are not directly 
Q attainable. 
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3. Norms exist for only on^ or two testing dates per grade. 

4. Articulation of scoreL be^-ween leve'^s is not well validated, 

5. Performance is not criterion-referenced for component skills 
(Although major publishers are moving to accommodate this need). 

6. Items are developed to discrimirate among individuals', not programs. 

7. Items are developed primarily to discriminate performance levels of the 
majority of typical children, so the items may not be as sensitive. 

to the patterns of lep-aing of educationally disadvantaged^^children. 

3. Test scores have a smaller error component near the ceiling than near the 
floor of performance on each form. / 

The first four probJ^ms obviously t ould be' solved by extension of the 
normlng process. Are they important, however? The following are some of the 
distortions of^ results that have been suggested to result from these problems. 
The fiiC- problem is that the particular sample being tested in a compensatory 
education e luation^ is not the same as a distrib^ution of children in the norm 
population with the same scores. ,For example, in the norm population, 
extremely low scores may be indicative of soire permanent or transient learning 
disabilities that are predictive of certain^eaming paths, whereas those low 
scores in ghetto «chools may be the result of environmental pressures. Even 
though some Title I participants will have been included in the norm groups, 
they will be a minority of the lor scorers because of the/ economic criteria 
for Title I funding.. Thus, for example, among students at the 20th perce*itile 
at the beginning of third g^ady ^hose that a\e likely to be selected for 
ccmpensatcry education treat* ^ats (e.g., from low economic Jtatus families) may 
be those that by the end of thKrd grade tend to -move t(5Ward the Ibth percentile 
while otl.ers mcve upward (or vic^s^versa) . 

This leads to the second need, for longitudinal norms. This need is 
clear when we consider that students aire geographically mobile, as well as 
dropping out at the upper gradesjf. Thus, norming must take into account student 
crbillty, or else the achievement of the popalaticn will appear to be diff rent 
from (usually greater than) its actual valuer More important, perhaps, is -he 
fact that ill a pret^st-posttest evaluation design, childr^^n taking the post- 
test will have had prior experience (the pretext) on another form of ttie test, 
wMeh experience the^members of the uorm group lacked. 
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The third problem, that norming is only carried put for one or two dates . 
in a school year, makes it difficult to measure the effectiveness of treatments 
over Intervals other than between appropriate testing dates. One solution used 
in practice is linear interpolation or extrapolation: if, for example, the 
norms are for a seven-month interval, but the pretest and posttest are given 
^ix months apart, scores are transformed to grade-equivalents (that is, to a 
gradte level for which the score would be che median) ,and then multiplied by 
7/6, to estimate what the gains would have been for seven months so that the 
scores can be compared with other treatments. 

A second solution, provided by some test publishers, consists of growth 
d curves obtained by curve-fitting procedures. The curves can be graphically 
used to interpolate or extrapolate gains, assuming the validity of the curve- 
fitting process. 

The four.h problem, articulation of levels, arises because norm-referenced 
tescs come with multiple levels, each designed for a particular range of grade 
levels. For many evaluations, it may be necessary to employ different levels 
for pretest and posttest to avoid floor or ceiling effects. To estimate the 
gain between pretest and pcJttest, it is necessary to convert the pretest and 
posttest scores to a common scale for tomparison. Tables for that conversion / 
are normally provided by test publishers; however, the empirical basis for 
arriving at the tables is usually limited. For example, a raw score of 50 on 
level A may correspond to a raw score 20 on level B for a sample of begin- 

' ning fourth graders, but that does not imply that the same conversion would be 
accurate for students at the end of fourth grade: skills learned in fourth 
grade (in a particular school) might be more related itams^on level A than 
on level 3, or vice versa. 

The other four problems, relating to item selection, are more serious. 
?irst, because the performance measbred on norm-referenced tests tends to 
involve unspecified combinations of ma"^' component skills, these tests are not 
sensitive to the achievement of speciiic criteria. Thus, prbgrams of instruction 
that focus jn a small set of component skills are unfairly judged using these 
tests. This was discussed under Is&ue // 5. 

Another problem is than standard achievement tests have been developed to 

. discriminate among individuals in such a way as to be predictive over the future 
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of the Individual. That means that they are developed not to be sensitive to 
particular variations in curriculum. The main criterion for selecting an item ^ 
from a pool of ruasonable items to include in a test has been its correlatior 
with the total score, not its correlatior with an external (validation) measure 
of skill attainment. 

The next problem is that item development has usually included administer- ^ 
ing items to a sample representative of the population and selecting those that 
discriminate best, and as a result, items that are particularly sensitive to 
the achievement of minority populations but are not as sensitive to achievement 
in the majority population have been deleted on item analysis, because they 
account for too little variance. Test 'publishers have recently given specific 
attention to this problem, and it may become less important in the future. 

Th^ last problem, which coacerns t sts consisting of multiple choice items 
where guessing is permitted (and how could one prohibit it?), is that the 
reliability of test scores is greater for scores in the top portion of the 
distribution for any form. At the low end, guessing accounts for a large part 
of the variance, while at the high end it accounts for little. This means, 
among other things, that small gains will be harder to detect in the lower 
region of the distribution than in the upper region. One sidelight on this 
situation is that an attempt to use out-of-range testing can appear to have an 
effect by itself: if disadvantaged 10th graders are given a test for 10th 
graders and score at the chance level they might appear to be three years 
behind; if given the form of the cest designed for 9th graders, as more appro- 
priate^ they mighL . so score near the chance level, so that their scores 
would appear to be three years behind the 9th graders, or four years behind 
their actual grade level. Thus, changing forms can increase (or decrease) 
the apparent deficit of a student by a year or more. A solution to this 
problem, for the evaluator, is to select a test on which each student will 
score in the mid- range. To do this for typically heterogyeous groups of 
students would require a test ma(f^ up of several articulated levels and 
administration that required flexible starting points for individuals of grossly 
different achievement levels. 

The problCTS of measurement via norm-referenced tests a^e most serious 
when che tests ave used for relative comparisons between a treatment and a 
aonrandom, ui. ^.itched comparison o^oup or between a treatment group and a norm 
population. X >r absolute comparisons and for relative comparisons between 
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randomly assigned treatment and control groups, the problems are not so 
serious. The reason, in the latter case, is that relative comparison in 
a randomized design dois not depend on norms and problems of item selection 
will apply equally to treatment and control students. 

It would seem at this juncture that there is need for some test develop- 
ment activity targeted at the needs of program eval iation. Because this is 
expensive, the private sector of the test development system will probably 
be very inquisitive about the market for such evaluative uses of tests in 
their plans for test develc^ment. 

Criterion- referenced tests are sets of items clustered around sets of 
objectives, or component skills, ^hose mastery is supposed to be equivalent 
to correct item responses. In the ideal case, items are selected on the basis 
that they discriminate perfectly between groups of students possessing a 
skill and groups not possessing it. InN<^ses of skills involving incremen- 
tal mastery of a large domain, such as voc^ulary, measurement of the objec- 
tive may be more complex than merely mastery or nonmastery, but may involve, 
for example, percentage of the domain acquired. 

The problems with criterion-refereflced tests are primarily in the area 
of availability and cost. Because the corcept has been implemented more 
recently than norm-referenced tests, fewer critarion-referenced tests of high 
quality are available. Given this situation, evaluators are tempted to use 
well-known and long trusted norm-, .ferenced tcjts. For some forms of evalu- 
ation decign, such .as compar ons with a population standard, the value of 
criterion-referencing if not readily apparent. In general, however, the 
arguments for increased use of criterion-referenced tests in evaluation 
appear fairly strong. In particular, the ability of thes4 tests to detect 
component skill acquisition addresses the complaint of some teachers CS Learns, 
1977) that standardized tests are relatively insensitive to the learning of 
a few component skills. 

Several of the str-.r.gths of cricerion-t-ferer ced tests do carry along 
correspondi-.ig problems, when viewed from a criti.d perspective, as in the 
presentation by Kosecoff and Fink (1976).- For fexample, to be fai; in evalua- 
txon of a program, the correct objectives to be tested must be specified by 
the teacher, and ^rror in matcliing tested objectives to instructional objec- 
tives will diminish the test's sensitivity to tYa? treatment. Thus, an . 
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evaluation will be biased by tlie teacher's degree of ability to match objec- 
tives. As another example, because different treatments have different 
objectives, aggregation of scores is more difficult than when % single total 
score is obtained. If different treatments have different objectives .J^en--^ 
comparisons of the treatments on a criterion-referenced basis would/nave to 
be a two-stage process : comparison of the extent to which each treatment 
met its objective and also comparison oetween the objectives. A treatment 
that failed to meet stringent objectives might be superior to onelthat suc- 
ceeded in meeting easy objectives. Third, criterion-referenced testing "would 
g^erate information about an enormous number of objectives, thus . wnpHcating 
the mat^gement, analysis, and reporting of data" (Kosecoff & ^ink, 1976, 
p. 2-35). The production of«'too muc^ information during an evaluation is a 
questionable basis for criticism; given modern computer methods ior data manage- 
ment and analysis, the added complexity, which corresponds to the greatest 
strength of criterion-referenced tests, their sensitivity, would be welcomed 
by many users of evaluation results. 

In conclusion, the selection of an instrument for measuring achievement 
in evaluations of Title I is dependent on the particular information needs to 
be satisfied and the constructs selected for 'measurement. Nationally st^^n- 
dardized (norm-referenced) tests have the advantage of greater credibility 
than locally developed tests, but they have the two disadvantages oi (1) 
encouraging evaluation in terms of comparison of local oerformance against 
inappropriate norms and (2> measuring program :irfonnancn in terms of tests 
designed to assess overall individual differences in achievement and thus 
insensitive to many dimensions of treatment effects. Criterion-referenced 
tests have the advantage of producing substantially more detailed and pre- 
cise information on the performance of each treatment in terms of its own 
objectives, but they have the disadvantage that, for the purposes of valid 
aggregation of results across treatments with different objectives, fairly 
complex interpretations of the results are necessary. 

To the extent that major publishers move to compute norms for criterion- 
referenced tests and to identify particular component skills that subsets of 
items on their norm-referenced tests assess (as appears to be the case), this 
distinction becomes less important: one could select a good nom-and- 
criterion-referenced test and interpret the results to fit the particular 
Information needs. 
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T««ua ?: What unite of measurement should used, ov: Are gvade-eauivate^.t 
aaorea really that bad'' 

This issue concerns the first step in summarization of results from 
testirg: should -Jach student's score be entered into analysis as a raw score, 
or should some transformation of that ?core be made first? The pToblem is not 
one of cost to the evaluator, at least when using transformations for which 
tables or formulas are available, but, rather one of validity versus communi- 
cability; the more technically correct units are net necessarily those that are 
easiest to understand or directly relevant to decisionmaking. The resolution 
of this -issue clearly must t.eat validity as fundrmertal and strive for «imal 
coamunicability among the technically-coxrect units, Conounicating wrong con- - 
elusions very clearly is worse than no communication at all. 

Otxe articular unit that has held widespread popularity but whose technical 
problems have made it notorious is the "grade-equivalent score." In several 
major evaluation studies (Wargo et al.. 1972; Briggs. 1973; Gamel et al.. 1975^ 
GAO. 1975; Thomas and Pelavin, 1976).. these scores were used because many state 
or local evaluations were being aggregated, and the ynlts most .frequently 
reported wete grade-equivalencs . In most cases, the^ authors expressed regret 
of that fact. TO deal with this controversy, we shall focus the bulk of our 
discussion on that unit, pointing out that various of its problems are shared 
by one or more of its alternatives. This is feasible because, with one or two 
. exceptions, any technical problem with any unit is also a problem for grade- 
equivalent scores. Tlie strength o: grade-equivalent scores lies mainly in the 
clear meaning they purportedly convey: a student with a grade-equivalent score 
of., say. 3.5 is apparently at the level of the median student with 5 months 
in«ru:tio-. in the third grade; if that^ score were obtained by-a student five ^ 
months tSrough fourth grade, then the student would apparently be one year 
behlru uiie national norm for his/her classmates. 

The seven major alternatives for measurement units are: 
1' raw scores : number of /items answered correctly; 

2. corrected scores : raw scores corrected for guessing so that a s-.ore 
of z^ro corresponds to pure guessing, as shown below for a test con- 
si^ing of Items each with k possible answers: 

/ NUM BER WRONG 
CORRECTED SCORE - NUMBER RIGHT - j^-j^; » 
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the proper correction for guessing does not count as WRONG those items 

c 

for which no response is made; 

3. whether a skill is mastered : a dichotomous 1 or 0 score indicating 
whether the student has or has not mastered the skill according to 
the test; 

- 4. rercen tiles : percentage of a peer population (national, regional, 
local, or any other population deemed appropriate for comparison) 
that would have achieved raw scores lower than the student; 

5. grade-equivilents : the number of school years of experience at 
which the rpw. score is the median, anchored at 1.0 for the beginning 
of first grade and altered by atttibuting one month'9 schooling to 
the dunmer quarter so that there are 10 school months per year to 
simplify coranunication; between dates of actual test nonh data 
collection, estimated median scores are obtained by curve-fitting 
procedures; - . 

6. normalized standard scores or normal curve ec^uivalents : transfor- 
mation of percentiles to normal deviates (in particular j. but not 
necessarily*, so that the mean score is 50 and so that 99% of the 
scores are less than 99) ; and 

7. growth scale scoies : a transformation of normalized standard 
scores on different test levels (grade levels) to a common metric, 
so that a student's growth can be plotted continuously across levels 
of a test. 

No matter which of these measures is used, questions of how to compare 
pretest and pbsttest scores or scores between groups remain. Thase are 
discussed under Issue 8. We aow turn to the specific problems of grade ^ 
aquivaltnts and their competitors. 

It is common to report a student's achievement as equivalent to the median 
performance of students at a particular grade level- Thus, for example, a 
student halfway through the fourth grade who was having great difficulty might 
be described a& "a year behind." This is a metric that is apparently 



*The term normal curve equivalents was developed by K'fC Research Corporation 
and refers to the specific transformation mentioned. The more general concept 
is referrecf to as normalized standard' scores. 
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Independent of any test, of any particular curriculum, and of any particular 
norm group. Moreover, it suggests to a parent the amount of effort needed to 
bring the student "up co standard." Even though we may criticize the proper- 
ties of grade-equivalent scores for program, evaluation, they serve a distinct 
purpose for communication of a student's or a class's average achievement in a 
school year. Jhxxa, test publishers include tables of grade equivalents for the 
raw scores on their tests. None of the other units have the same clarity and 
simplicity of meaning, although foi two of the units the meaning is fairly 
direct: percentiles indicate an individual's rank relative to a peer group, 
and because it is that peer group with whop he/she will be competing throughout 
life for the best jobs and highest quality of life, "getting behind" and "getting 
ahead" in percentile terms are meaningful; and indicators of particular skill 
mastery are directly meaningful to the extent that the skills mastered are 
directly meaningful (however, some theo"reticaHy meaningful skills, such as 
"decoding" or the Piagetian concept of "conservation," may not be obviously 
relevant objectives for basic skills instruction for some audiences) . 

The problems of grade-equivalent scores, as well as other units, stem 
both from their definition and from their operationalization. The problems 
stemming from operationalization could presumably be solved with a sufficient 
expenditure of funds, if the fundamental problems with^he concept were not 
serious. The fundamental problems for grade-equivalent scores derive from thfe 
facts (1) that achievement gains are not linear as a function of months in 
school; (2) that summer period presents special problems; and (3) that the 
performance of a student a year below grade level is qualitatively different 
from that of the median student a year younger. The operational problems arise 
firom the fact that norms for standardized tests are published for a single 
testing time in the school year, or at most two times, so that grade equiva- 
lents for most, testing datfis must be arrived at by Interpolation. 

. The fact that achievement is not linear as a function of time can produce 
distjorted results. In the Thomas and Pelavin study (1976) for examirt.e, larger 
average grade-equivalent gains were reported for compensatory education pro- , 
grams In high school than in the primary grades. Although Thomas and Pelavin 
did not interpret this effect as meaningful, others might. However, that 
effect Is probably an art^f^ because, for example, an individual at the 20th 
percentile might be a half yiar below grade level in second grade but three 
years below grade level in tenth grade, so bringing him/her up to the median in 
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a year (unlikely, but taken for simplicity), a gain of 30 percentile points in 
elt^her case, would 3how a i.5 month-per^onth gain for the second grader but a 
4.0 month-per-month gain for the tenth grader. At another level, learning a 
specific number of component skills may lead to a 20-percent ile gain at one 
grade level and a 30-percentile gain at another grade level. • 

The second problem concerns the summer. The lesser problem with the 
summer is its definition as a single month for the construction of grade 
equivalents, so that, added to the presumed nine-month school year, it produces 
a ten-month year in which decimal tenths correspond to myonths. This clever aid 
to communication has the unfortunate consequence that grade-equivalents can 
never be coijsidered quite adeqtiate for use in research on ^achievement growth 
patterns because the summer "month" is ill-defined. The more serious problem 
is that students who are achieving at levels lower than their peers may actually 
lose ground, in absolute terms, over the summer (that is, they actually have 
"mastery oveY fewer academic skills at the end of the summer than they had at the 
beginning of the summer, while the brightest students may gain at a rate at 
least as great and often surpassing their rate of gains during the school year. 
(Although this result has not been proven, reports by Kaskowitz and Norwood, 
1977, and Pelavin and David, 1977, are highly suggestive.) The result of this 
difference in students' forgetting and extracurricular leart^ing is to make 
school-year compensatory education programs seem to have only short-range 
effects: when measured from fall to the following spring', compensatory educa- 
tion students show strong gains, but the f.adents in the programs year after 
year may fall further behind their peers. This problem is not merely a problem 
with grade-equivalent scores but, indeed, with the underlying assumptions of 
compensatory education, and the issue is discussed further in the synthesis of 
substantive findings on Title I. However, it causes critical problems for the 
use of grade-equivalent scoies and especially distorts any studies that aggre- 
gate results from fall-to-fall (or spring-to-spring) tests with results tiom 
fall-to-spring ^sts. 

The third fundamental problem with grade equivalents, and with other, 
scores based on a national norm sample (percentiles, normalized scores, and 
growth scale scores), concerns the multidimensionality of achievement growth. 
The assumption implicit in the use of grade equivalents, although not necessary 
for their construction, is that there is a certain amount to be learned in each 
grade. In each region of the country and in each classroom, howr er, particular 
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goals are set that are different, to a greater or lesser extent', from the goals 
assessed in standardized tests. Among o$her things, children start school at 
different ages and have different numbers of school days per year iii different 
states. Furthermore, the amount a fourth grader who is a year behind knows is 
likely to be qualitatively different from the amount a third grader knows, - 
although their total test scores may be the same. The use of grade-equivplents 
promotes a simplistic, unidimensional view of "achievement. That simplicity 
Bust not get in the way of discovery of particular achievements and deficiencies 
in student and program performance. 

A special operational problem for grade-equivalents is that they ar^e based 
only on data coUected at one or two points in the school year. If tests 
actually are given in ar evaluation at either testing tlines than those for which 
norming was done, interpolations must be performed co obtain grade-equivalent 
gains. Thus, if the norms are for September 20 and May 20, «=ight months apart, 
. ■ and testing is done on '-^tober 5 and May 5. seven months apart, evaluators must 
multiply gains obtained by 8/7 to compare gains occurring in the norm group. 
The possible distortions caused by such interpolations are so great that test 
• publishers and evaluators have called for all testing to be conducted at the 
same time in the school year as the norm group was tested. Thus the use of 
tests with only single norming dates (e.g., in the spring) in evaluations based 
on fall to spring gains is highly questionable. 

The fact that grade-equivalents are based on the performance of average 
students makes them less useful for studies of students who deviate substan- 
. tially from the average (e.g.. compensatory education participants). It/would 
be preferable to establish expected per-year, or per-month, achievement of ■ 
students in various percentile ranges, based on >gitudinal norming. Then 
month-for-month gains could be reported for compet; .acory education students in 
comparison with students or comparable prior achievement levels. 

For raw scores, the fundamental problem is interpretability . The only 
real meaning for a raw score is its comparison with some other raw score on 
the same test. If that comparison is the goal of the evaluaUon, then raw 
scores may be the most appropriate unit. Raw scores are not- guaranteed to 
have a normal distribution, however, which is required by many procedures; 
normalized standard scores or normal curve equivalents at least* partially 
solve that problem. (One should note, however, that transforming both pretest 
and posttest scores to normally distributed scores definitely does not ensure 
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that the resxilting bivariate [two-dimensional] scatter plot of scores will 
conform to the bivariate normal distribution required for some analyses, such 
as analysis of covariance.) v 

Correcting raw scores for guessing improves their accuracy by eliminating 
any biases that might be due to greater tendencies to guess in some groups. 
Note that this correction for guessing requires that two raw score's be obtained- 
for each test: the number right and the number attempted but wrong. Similar, 
but m^e sophisticated, test scoring procedures have been suggested dn the 
psychometric literature and involve giving a differential fractional score to 
each of the wrong answers, refleating the amount of acMevement necessary to 
choose that particular wrong answer— some answers are more cleai?ly wrong than 
others to a student with partial knowledge. Such scoring has yet to be applied 
to real evaluation settings, but it will provide greater sensitivity within the 
particular testing time limits when it becomes fea^lMe. 

The primary problem with use of a dichotomous mastery score for each 
section-of a test -s that it still leaves unspecified the procedures for \ 
smmnarizing each individual's performance as a single score. The alternative 
to a single score for each individual, of using instead a multidimensional set 
of mastery scores for each individual , -would necessarily require mu^iivariate ^ 
statistical procedures in an evaluation, which somewhat increase the compu- 
tational costs of data analysis and requite substantially greater expertise on 
the part of evaluation data analysts. 

Normalized standard scores and percentile scores are conceptiMlly quite 
similar: they both are obtained as transformations of raw scores to a sym- 
metric distribution. In the case of normalized standard scores, the results 
are normally distributed; in the case of percentiJ'? . c - they are uniformly 
distributed (that is, in th.^ norm population, the - me --'mber of individuals 
receive each dif f erent.^eroentile score). The /alid u: sou -^hat evaluators 
prefer normklized standard scores over v-rcentlles relates co rtte validity 
of using them in standard statistical daf.- ar?i.ysi^ procedures. Analysis of 
variance and all of its variants depend on no- .ality of scores, and percentile 
, scores deviate from no-mality sufficiently to distort th6 conclusions reached 
from the analyses. Occasionally, the argument is heard that normalized scores 
are "equal interval" iccris, meaning that the difference be/ween- a normajlized 
score of 10 and ZO is the "s^" as the difference between a score of 20 and 
30, anu that percentile scores are not '.'equal interval" score3: The grounds 
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for this argument are extremely tenuous. First, there is one sense in which 
percentiles aye equal interval scores: the differences between the 10th and 
20th percentiles and between the 20th and 30th percentiles both represent 10%. 
of the population. Second, tne claim that normalized scores are equal inter- 
val scores is based on the theory that the achievement test is measuring some 
underlying factor in the individual that i^ notmally distributed. This theory 
is. in fact, plausible because of the central limit theorem, which can be 
paraphrased as saying ttiat anything (e.g.. reading achievement) that is the 
sum of many independent random component factors will. tend to be approximately 
normally distributed. However, the theory that the underlying factor being 
measured, is normally distributed is only plausible, not proven; therefore, any 
claim that a gain in a normalized achievement score from IQ to 20 represents 
an equa^ amount of learning as a gain from 20 to 30 should /b« disregarded. 

' Finally, growth scale scores are similar to normalized standard scores 
except that growth scale scores add the additional capability of comparison 
across different levels of a test. Test publishers produce growth scale scores 
by giving two adjacent levels of a test to the same or matched set. of students 
to determine which (normalized) score on one level of the .est- is equivalent to 
each (normalized) score .n the other level. Using this method, a single scale 
of achievement cat, ,be^ constructed that ranges^ from first grade through high . 
school,. . ^ ' \ \ 

Of the several methods of assigning numbers to test , .rf orman^discussed 
In this issue. so,r.e are clearly preferable to others. First, cprrection for 
guessing is essential to remove biases engendered by differential tendencies^ 
to guess. NO matter how explicit the ihstructions on guessing are (and they ^ 
are frequently vague), different kinds of children and children in classrooms; 
wit^h teachers of different personality characteristics are going to exhibit ,, 
different tendencies co guess. 

Second as long as norm-refereiiced interpretations^ are to.be made or any 
comparisons involving forms of analysis of variance are to be performed, the 
scorer suould be transformed to normally distributed sc6res (normalized stan- 
dard scores, normal curve equivalents, or growth, scale scores) before entry 
Into analysis. ^ 

Third, careful consideration should be given to the use- of multivariate 
analyses of mastery scores, for component skills. assessed^y tests. Using such 
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analyses, it would be po^BiW^e to go beyond merely concluding that one group 
leajned more than^anotheV toyreach- conclusions abo^t what types of skills were 
iliost effectively learned ttirough diffjerM^ treatments. ' 
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/ Filially, ^ade. equivalent scores shpi^ld be avoided whenever possible. ^ 
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From a simple point of view, these iksues concern the avoidance of plt- 



Analys is 

Introdtiction 

The three issues discussed in this section concern the process of trans- 
formation of measurements on Title I projects and participants into informa- 
tion relevant to decision rationales. Frequently this is the weakest link 
in evaluation and therefore a target for challenging a study* s usefulness. 
Establishing the link depejfids crucially on the identification of research 
questions or hypotheses for which (1) there are methods, based on tenable 
assumptions, for deriving answers to the questions trom the data, and (2) 
policy implications of the answers cari bej deduced in a clear and logical 
manner. [ 

ifes 

falls that can render veU-collected data valueless. Fror a more sophisti- 
cated point of vipwt they concern pitfalls in the overall design of an eval- 
uation. Proper flpijesight in study design and data collection is needed to 
prepare for "airt||jit" analyses and interpretations. Frequently, the key 
element can be whether the data collection had included a particular item of 
data tha"- would verify an assumption needed to validate a chosen <;nalysis, 
so consideration of date analysis prior to development of questionnaires is 
essential for valid evaluation. 

/ 

The three issues discussed in this section concern problems /that arise 
when ideal evaluation designs, including andora assignment to treatment and 
control conditions, are infeasible or are otherwise not implemented. These 
problems can be dealt with in an ad hoc fashion for each evaluation, by care- 
ful planning and use of statistical expertise;" the purpose of the discussions 
in this section will be both to pplnt out the 'oblems and to suggest methods 
appropri£^te for the ad hoc solutions. It is the opinion of the authors, 
however, that more who list ic solutions, such as changing the framework of 
comparisons (as suggested under Issue 1) or finding ways to justify more 
rigorous inf oriaation-gathering designs, wilJ ultimately be nercessary. 

The first issue (Issue 8) concerns the conditions necessary for making 
ir^arences from a relative comparison between nonrandom treatment and , control 
groups. Each of the methods proposed is based on some set of assumptions, 
and the discussion will attempt to estimate the reasonablene.^s of these 
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MsumptlonB and to suggest ways of testing them. The most common analytical 
method used, analysis of covariance, will be described U some detail. 

The second issue (Issue 9) concerns the problems that have arisen in 
attempts to make inferences about the relations of treatment components and 
costs to effectiveness. That type" of information is the most useful infor- 
natlovi that can be acquired for the purpose of^^roving the quality of com- 
pensatory education, and yet it has usually been gathered as an adjunct to 
an evaluation more concerned with some other purpose. As a result, many 
conclusions concerning the relative effectiveness of different methods that 

J** 

• have been made in federal studies of compensatory education are highly ques- 
tionable. The discussion of this issue will attempt to identify the most 
crucial threats to validity of such concl.isions and to suggest ways of dealing 
with those threats. 

The third issue (Issue 10) concerns methods of aggregation of data. 
Both the sampling units and measurement units affect the meaningfulness of 
confining data across projects, and the discussion of this issue will attempt 
to clarify the alternative acceptable aggregation methods and the reasons 
Others are unacceptable. 
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leene 8. What are tK? aonditiona for valid Qompcori^ons between nonequivalent 
treatment and aompariaon oroUps? 

This is an important 'and controversial issue because there are methods 
for such analyses at hand that appear at fi¥st to be valid but have been 
shown on closer examination to be responsibie for distortions in conclu- 
sions .-^ In fact, the difficulty of selecting the appropriate analysis has 
been suggested as grounds for resolvitig: the issue by avoiding comparisons 
betveen nonequivalent treatment and comparison groups* Alternatives to 
such comparisons were discussed under Issue 1* The perspective for the 
following discussion concerns what to do when one must make such comparisons. 
In adapting quantitative analysis methods developed for controlled experi- 
ments into the area of quasi-experiments in the field, various assumptions 
on which the methods were based have been violated, and methodologists have 
recently focused a great deal of attention on ways to weaken the assumptions 
and still maintain the validity of the methods. 

Nonequivalent treatment and comparison^ groups are ^. v pair of groups for 
which it is not true thac their members might have been assigned to the other r 
grqup but for a random (or pseudo-random) event. Any method of assignment, 
such as matcheii pairs, that is ncc functionally random will qualify for 
having the problems discussed below, but the more different the groups are, 
the more substantial will b'^e biases be that result from violated as^;j*nptions . 
Basically, the purpose of a comparison group is to provide an estfaate of how 
well the treatment group would have performed if it had not had th^special 
treatment. The purpose of each of the methods discussed here is t(f trans- 
form a nonequivalent control group into a group that, except for the treatment, 
is identical to the treatment group, so that the comparison is possible. 
This transformation is not necessary in the case of randomly assigned groups, 
because any differences between such groups will be random, not biased, and 
therefore they can be statistically accounted^ for with a high degree of validity, 

There are basically four methods for "equating" nonequivalent groups, 
although there are a jiumber of variants in methods. The four methods are: 

(1) matching, long denounced but recently revived by\Sherwood et al. (1975); 

(2) gain score analysis, also frequently derogated but recently revived by 
Kenny (1975): (3) analysis of covariance (ANOCOVA) , a powerful analysis tool 
in experimental psychology but problem-riddled in educational field research 
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and eval«ltion; ind (4) regression analysis. A fifth "method," ignoring 
the nonequivalence, might be, considered for completeness; however, its merits 
are so inferior =to the methods to be discussed as to rule it out of considera- 
tion. 

Many of the ijxpbleins to be discussed are present with all four methods; 
however, the methods are not equivalent. As background, we shall briefly 
define and list the assumptions of each method-^ ^ 

Matching is relatively simple tio describe. It consists of searching for 
pairs of subjects (e.g., student^, classrooms, or school districts), one in 
th* treatment and bne in the comparison group, who are as similar ^^possflble 
on- relevant dimensions, deleting all remaining subjects from the "analyses to 
be done, and then performing analyses (e.g., t-tests) as if the groups were 
randomized pairs (as if you had selected the pairs prior to the treatment and' 
had randomly assigned which was to receive the treatmeflit) . The basic assump- 
tion of this method that has been questioned in many ways is that the matching 
' is complete, meaning that there is no systematic difference remaining between 
treatment and control subjects who are matched that could possibly affect 
their performance. This assumption is clearly f als^ f or educational evaluations 
yhen matching is on a^^gle dimension: human behavior, and, in particular, 
the achievement of cognitive skills, is so multiply 'determined that no 9ingl^ 
measure can capture al^' the systematic variance among people capable of 
affecting later performance. However, in a chapter in the Handbook of Evalu- 
ation Research (§tru6ning and Guttentag, 1975)., Sherwood, Morris, and Sherwood 
have investigated the reasonableness of the complete matching assumption if 
one matches on a hundred or more variables simultaneously r they found i. tching 
to be valLd in the case of an evaluation study they carried out. A problem 
with matching on a large number of dimensions is. in finding adequate matches. 
For example, if matching is on 20 dichotomous variables- and 10 variables, 
with 5 gradations of level, the number of cells in the population iS 
220 X 5IO i 10 trillion. Even if some variables are moderately corre- 
lated, the likelihood of finding 100 matched pairs in a sample of 10,000 
treatment and 10,000 control subjects is small. The solution of broadening 
the gradations (changing frpm 5 levels to 2 levels, for example) , ^even if 
it reduces the number of possibilities to a manageable number, is frequently 
unacceptable because there can then be systematic variation within level^^. 
Suppose, for example, a low economic stat^- group and an (overlapping) high 

V 

ERIC ^ 



95 



ecipnonlc status group were matched on just three levels of economic status. 
There would be a range of status within each of the three levels, and^one \ 
would expect that at the lowest of the three levels the subjects originally ^ 
from the low economic status group would be on^he average' lower than the 
"matched" subjects from the high economic status group, and so forth. That 
is, too coarse a match is really tjot a match at all. 

Al^houjh matching by itself does not appear to provide an adequate solu- 
tion to the problem of comparing nonequivalent groups, it may be useful to 
do in conjuction with statistical methods described below. The bias in sta- 
tistical correction procedures is Jeast when the groups are most similar. 
Whenever matching is undertaken, however, possible distortions in conclusions 
resulting from matching must be considered explicitly. These distortions 
generally Involye some processes that would afct differently to cause a par- 
ticular score on a matching variable to occur in a treatments grou^han in a o 
comparison group. See Rubin (1973, 1976a, 1976b) for further recent discus- 
sion of matching. 

Gain score analysis is similarly easy to describe: the method is to 
create a derived variable ("gain") by subtracting a pretest score from the 
posttest score and to perform analyses on this derived variable as If the 
treatment and control groups were randomized. The basic assumption ts that 
pre-existing differences between treatment and control groups, as evidenced 
by differences on pretests, will not be correlated with later gains. If 
that assunption were true, then gain scores would be quite appropriate for 
comparisons in evaluation, because they focus on the effects of the treatment. 
The frequently noted fact that gain scores have greater random -error cbmpon- 
. ents (lower reliability) than either pretest or posttest scores is largely 
Immaterial for moderate- or large-scale evaluations, because increasing saoiple 
size reduces the :'.mportance of random error components. The basic assumption 
that gains are Independent of pre-existing differences is, .however, highly 
questionable in applications to education. , Gains are the result of complex 
combinations of motivational ana cognitive processes, and although achievement 
evidenced at pretest is also dependent on such processes, subtracting the pre- 
test score will not remove the effects of different motivational and cognitive 
levels on rate of gain between pretest and posttest. Moreover, galas are 
subject to the statistical artifact that individuals with high pretest scores 
will tend to have'smaUer gains because, for some of them, the high pretest 



ERIC . i'K) 



96 

scores were "lucky," and conversely for individuals with low pretest scores; 
that la» regression to the mean is ♦•o be €ixpected. 

The third Wthod of Interest is ANOCOVA. This method is' more compli- 
cated to describe, although it iV conceptually straightforward. Basically, 
t^ method is to focus on nosttest scored and to hypothesize ^hat the post- 
test score is a sifn of a number of different effects In^ addition to treatment 
effect (usually including the level o^^achiervement indicated by a pretest) . 
All the factors (called covariates) , that might have effectsNare measured; then 
the amDunt of effect of these factors (their beta weights oi| regression ^weigjits) 
is estimated from the data; then all the effects due to nontreatment factors 
ye siibtracted from each person's posttest scores; finally, the results are 
analyzed (residuals) as if they were obtained from randomly -,assigned treat- 
meat and control groups. 

The basic assunrptions of ANOCOVA are: . 

1. as with other methods,^ the assumptions needed for the analysis of 
data from randomized designs, primarily that observations on dif- 
ferent subjects are independent of each other, that there^is approx- 
imately the same possibility of .random error in each individual's 
score, and that rana ^m errors are distributed approximately ^s the 

_ normal bell-shaped curve; ^ 

2. that the potency<.of effects of the covariates on posttest scores 
is the same in treatment and c on trbL, groups ; 

3^ that except for factors perfectly measured by the observed- coWri- 
ates, the groups are equivalent, that is, indistinguishable from a 
randomized pair of treatment; and control groups; and 
• 4. as with other methods, tha|!' the dependent variable can be assumed * 
to be a linear measure of the underlying factor about which one 
wishes to draw conclusiofts (e. g. , that a particular gain at the high 
end of a test score continuum has the same meaning as a gain of the 
same number of units at the middle and lower extremes of the curve). 
'The first of these four assumptions, as noted, appli^si to any of the analyt- 
ical methods. It is included here, however, because ANOCOVA is the only one 
of the four methods that includes as an integral p^rt what analysis is to be 
done after^ groups ^re "equated/' 
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. Figure 5 is included for those who would like an algebraic description 
of ANOCOVA: It may be ignored without loss of continuity in reading. Most 
intemedlate-ievel texts on experimental design' (e. g. , Winer, 1962) include 
presentations on ANOCOVA. 

The fourth methcd, residual gain score analysis, ^s quite similar to ^ 
analysis of covarlance, and at times the two have been confused. Residual 
fftin score analysis consists of (1) calculating est^imates of each posttest 
score based on correlatflons» with pretest scores and other covariate factors, 
(2) calculating residuals by subtracting the estimates from the actual post- 
test scores, and (3) performing analyses, such as analysis of variance (ANOVA) , 
using the residuals as the variable of interest. Werts and Linn (1970) have 
shown that -^residual gain score analysis is based on a statistical model that 
Is a special case of the^ model underlying ANOCOVA; that is, it requires stronger 
assumptions than ANOCOVA; It is a reasonable generalization, therefore, that 
whenever residual gain scores are reported, statistical significance tests 
' should be based -on true ANOCOVA, not on the application of ANOVA to the residuals. 
Of the four methods, ANOCOVA appears to be generally the best -choice for ^ 
l^st oitua^iions. Although other methods may be appropriate for situatictas in 
"sAich particular assumptions are satisfied, ANOCOVA is more general. Thus, it 
is with dismay^that practical evaluators and educators" have heard and read the 
severe attacks on the method by expert methodologists. These attacks have 
pointed out ways in, which the assumptions might be violated in educational 
evaluation^ and hov they mlghl: distort conclusions. 

The first major blow to ANOCOVA came from its use in the Head Start eval- 
uation. Campbell and Erlebacner (1970) pointed out problems, while Cicirelli 
(1969) artd Evans (1970) defended the evaluation. Campbell and Erlebachet's 
p^entation included graphic presentations of the way ANOCOVA, when applied 
without regard to th^, assumptions underlying it,, can systematically bias eval- 
uations and produce just the sort of negative conclusions that the Head Start 
evaluation arrived at. The problem they Identified ts now but one of many 
for ANd^VA; it was a particular violation of' the third assumption, which 
Campbell and Erlebacher argued would app;y to snost^aluations of federal 
education programs. The problem is that ANOCOVA will\,y^ correct for all the 
possible causes of lower achievement in the disadvantaged group, particularly 
when the pretest contains a portion of random error. This problem and others 
are discussed later in this section. 
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Figure 5. Algebraic description of ANOCOyA 
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Th« r««idu«lf . ■ - ^ij • r«iJrM«nr. the error of neaaurenenc reneinlaR after the eftecte 
of 'the covmtletes he^ been accounted fdr. If the treatment haa no effect, then they should 
b« appro:da«tely the s^oe. stase as the previous .residuals calculated, f^the^ treatment is 
effective, tliese residuals should be much lirgex than those previously calculated. 



The AROVA test statistic is 



vhlch it compared to cables of the' F-distribution. with 



m - 1, ^ rn^ - i; - 1 



degrees of freedom. 



If the obtained statistic is larger than the table entry for. say j:he ,05 level of ^ 
-i^^^..ance. the conclusion is that there is at least a 95Z probability that the^ groups 
differ because of th« treatment. 



Figure 5 T (Algebraic description of ANOCOVA) , continued 
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In a more recent evaluation, the Compensatory Reading Study (Trismen 
et al*, 197J), ANOCOVA was used v<iere the covariates for predicting the pcst- 
test score were"^ (1) the pretest score and (2) th^ square of the pretest score. 
This means diat the estimates can be curved (quadratic) functions of the pre-, 
test score — not any possible curve but only simple concave or convex curves. ^ 
One reason for using the quadratic term is that the' levels of pretest scores^ 
of compensatory participants and others are different, and curvilinear regres- 
sion allows for the legitimate possibility of a different regression slope 
(Assumption 2) between the two groups. See Figure 6 for a pictorial example 
of such a case; The Compensatory Reading Study's analyses of covariance were 
plagued with having to reject the analyses because of .violations of Assumption 
2 (equal repression slopes within different groups). Even with the quadratic 
'term, 44 of 160 critical tests of hypotheses in that study were uninterpretable 
because of lack of homogeneity ^f regression slopes between the groups being 
completed. Lack of homogeneity ^of regression slopes means that pretest and 
posttest are more highly corrtelated in one, group (in Figure 6, the compensatory 
group) than in the other. 

There are numeroq? explanations of different^-al slopes. Among them are 
floor and ceiling effects, to be discussed below. The Compensatory Readirfg 
Study made great efforts to avoid" floor effects, but scatterplots indicated 
some ceiling effects. ' Guessing can cause slopes of regressions to vary .Acrbss 
the range of pretest scores (i.e., will'catise nonlinear regressions) ; devia- 
tions of score distributions from normality will produce nonlinear regressions; 
and differential growth rateg can produce nonlinear regressions. A significant 
problem with the use of the quadratic cerm in the ANOCOyA b^ the Compensatory 
Reading Study was lack of investigation of the causej of the nonlinearity . 
A more careful analysis would be likely to suggest a. particular type of curye, 
rather than an arbitrary parabola, and it might even suggest a transformation 
of the scores that would lead to linear, homogeneous regressions (th^ Compen- 
satory Reading Study analyzed raw scores, not normalized scores). 

The technical sumwiry of the Compensatory Reading Study (USOE, 1976) 
includes several alternative analyses that produced varying results when 
applied to the same data. Among them were gain score comparisons and compar- ^ 
isons of relations to a national norm population. Although the results of a 
die residual gain score analysis carried out by Educational Testing Service 



th^t ttuif^ (referred to .in Trismen et al., 197S, as analysis of ''covarianca\ 
subtly favored th4 nonconq^nsatory groups , t}»e results of the other analyses , 
carried out by USOE, slijihtly favored thjB compensatory reading groups ..'^at 
different results arose from these different analyses is ndt helpful for thf 
utility of the study. Ideally, the results shcild converge' to the same conr • 
elusion, so the audience could feel confident that the conclusion w^a indepen- 
dent of the.tmalytical method. " ■ ' ^ 

- A third use of ANOCOVA in compensatory education evaluation is imminent. 
The U.S. Office of Education has undertaken' to provide technical asiistance 
to state and 'local edv^ation agencies in their efforts to carry out evalua- • 
tions.' As- a vehicle for this technical assistance, BMC has developed several . 
evalp^ition tuodels (Horst, Tallmadge, and Wood, 1975), some of ¥hich. Involve 
ANOCOVA. "Model C" ii}. that framework involves the tise of ANOCOVA for a partic- 
ular type of nonequivalent treatment and control group. The essential concept - 
•of that model is shown in Figure 7. The procedure is to give a pretest and 
CO select for compensatory treatment only those students who fall below aoxp^ 
criterion level. Then, after treatment and posttests are complete, the proce- 
dure is (1) to calculate the relations*»ip between pretest and posttest based 
on the control group, (2) to extrapolate thi- relationsh^^ to predict, the .treat- 
ment group's posttest scores, anu (3) to test whether the treatment group's 
scores ?re significantly different, from (hopefully ^bove) their ifredicted level*. 

.This model, discussed in abstract terms by Kenny (1975) and in more detail 
by Rubin" (1977), cleverly avoids criticisms leveled at 5ther ANOCOVA models 
in that it does not allow groups to differ in any systematic way not perfectly, 
ueasured by the pretest. This is accomplished by allowing the teacher .no 'free- 
dom $0 introduce, other factor besides the pretest score Into the determi- 
nation o: who is in the treatment and control groups. ' Of course, that means 
that if a teachefNused Ifis/her judgment during assignment pf Ptudents to the' 
compensatory education class i "knowing" that a student could perform better 
than his/h§r score indicated or that a student happened to make 'lucky guesses 
on the pret^pt, the' results using Model C wotild be distorted. The clevetness 
of the model nay also be a weakness "in anothej: sense: more than any other 
variant of ANOCOVA, ft depends on the assumption that the two-dimensional 
scatter of pretest and posttest scores^nforms to a (blvarlate) normal distri- 
bution: Although it is straightforward to transform pretest scores and post- 
test scores separately to a'nonnal distribution (see Issue 7), that does not 
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ensure that the two-dimensional scatter will be^a bivariate normal distribu- 
tion or that the regression will be linear. 

Another problem with this solution is that it fails to address the ques- 
tion of whether the two groups (compensatory and regular instruction) are 
really fro the same population: it assumes they are, but Campbell and Erle- 
bache*- (1970) have; argued that they may be different. If you select only* 
according to a pretest, it still may be that you are separating populations 
that have different achievement expectations. Because the solution appears 
to ^e gaining a significant degree of popularity, we digress to describe an 
example in which selection on the basis of a "pretest" would obviously sepa- 
rate according to populations and would therefore lead to distorted conclusions, 
Siq>po8e that there were a classroom with 10 English-speaking (Anglo) fifth- 
graders and 5 non-English-speaking Mexican-American fif th-gradets and a third 
of the class were assigned to a remedial reading program on the basis of an 
English vocabulary test. With high probability, the Mexican-American children 
would b^- given the treatment, and no amount of statistical eqv^ting would 
remoO^ the population effects on, a reading pnsttest; it is jus^ not meaningful 
o extrapolate from the results of a comparison group to the expected results 
for a different population. The point is that selecting purely or the basis 
of a pretest does not ensure that the treatment and comparison groups are alike 
except for pretest scores. 

In order to understand broadly the controversy over ANOCOVA, we need to 
examine some types of effects that lead to violation of the assumptions of the 
method. Campbell and Boruch (1975) have discussed six such problems that are 
well known at present. More problems and vaifiants of the problems and new 
problems with new variants of ANOCOVA are to be expected. The six problems 
discussed by Campbell and Boruch are: 

1. underadjustment of pre-existing differences; 

2. differential growth rates; 

3. increases in reliability with age; 

4. lower; reliability in the more disadvantaged group; 

5. test floor and ceiling effects; and 

6. grouping feedback effects. 

Eacb of these protlems will* be dealt with hera briefly. 
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Undcradiuataent of pre-existing differences violates the third ANOCOVA 
aaaiovtion in that differences remain after the effects of the covarlates are 
partialed out. These underadjustments arise from any systematic rules that 
lead to assignoent to groups ofher than by a single perfectly reliable measure. 
The underadi'istment arises from the "regresslon-to-the-mean" artifact In esti- 
mating posttest scores In ANOCOVA. Whenever regression Is used to estimate 
scores and the covarlate has a random error component, the observed regression 
line will be less steep thaii the slope of the underlying relationship (see 
Figure 8). For example, suppose " + ^i ' ^1 ^21' 11 

"and are random error components. Since, except for random error, X and Y 
are both equal to T, the "true" relationship would logically be Y - X. However, 
if the variance of the errors Is, say,aOZ of the variance of T, then the 

observed relation will be Y - .909X. That Is not an error of the regression 

o 

method but rather a theoretical limitation of measurement. 

If t;here are some population differences between those students selected 
for treatment and controls, such as teache's' judgments of aptitude, that are 
measured by the pretest but with -some small random error, and if that difference^ 
has any effect at.a^l on postkest scores that is not reflected in the pretest, 
the ANOCOVA test statistic will tend to indicate the posttests of the two groups 
are farther apart than they rdally are, because ANOCOVA assumes that except for 
the pretest the groups are completely equivalent. A solution to this problem 
ha^ been proposed by Lord_(1960) , Portfer (1967), and Porter and Chlbucos (1974) 
and discussed and extended ^y Campbell and Boruch. The solution Involves 
measuring the reliability of measures used as covarlates and then increasing 
the regression coefficients to correct for the error in the covarlate. In our 
example above, knowing that the variance of errors is 10% of the variance of T, 
or that the reliability of X is , 



variance of T 

— T '■ r-^T— ■ - .909, 

vai^ance of T + variance of E 



we would divide our observed regression coefficient by the reliability to obtain 
a' hypothesized relation of Y - l.OOx, which is the true relation. This correc- 
tion, referred to as "true score analysis," was Investigated by Marston and 
Borich (1977), who found that it tended in some cases to produce too mafty 
statistically significant results. St. Pler^fe and Ladner (1977) investigated 
the effect of this correction on the results of . the Follow-Through evaluation 

lit 
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and found that the results did in fact change when the correction'^ was made, 
so one cannot r^ly on the easy reply that ''it does^^ make much difference 
anyway." 

Differential growth rates are well known to occur in education. One need 
only look at test publishers' growth scale curves to see that (1) younger chil- 
dren learn faster (e.g., the overlap in scores between first and second graders 
is less th« ^he overlap between fifth and aixth graders) and (2) children at 
the lowest percentile levels learn slower than other children. .)Thus, equating 
groups, on a pretest, whether it J.s done by matching, by gain sc^je analysis, 
or by ANOCOVA, will not necessarily equate them on expected growth rate, so 
the treatment with the fastest learners will be the one that appears most suc- 
cesdful. Kenny (1975) has proposed that if one<can collect data on expected 
differential growth rates, use of those data in a standardized gain score 
analysis would be appropriate. 

Increase in reliability with age , which results from the at;tributes of 
standardized tests that they tap more true score variance and less random 
error among older students, has the effect of making scores that are equally 
far apart on pretest and posttest appear to be more reliably (statistically 
significantly) different at the^ime of posttest. Campbell and Boruch point 
outcthe neeQ for a model of reliability changa so that analyses will be able 
to correct for this artifact, and they propose such a model, but they note 
that their "model is still very primitive and oversimplified." 

Lower reliability in the disadvantaged group is another way in which 
Campbell and Boruch suggest that equal true score gains can result in greater 
observed score gains for one group than for another. The gatnp although 
equal for the two groups, will be less statistically significant for the dis- 
advantaged group. 

Floor and ceiling effects can be quite serious, because it "^s nearly 
impossible to correct for them, after they occur. If a large percentage of 
students achieved a perfect score on a posttest, it is certain that their 
gains would be underastimated, but by how much is unknown. Furthermore, for 
ANOCOVA, the slope or the regression curve of posttest as a function of pre- 
test among the students at the ceiling will be nearly horizontal, because 
no differences on. posttest will be observed for these students although there 
may be differences at pretest. Therefore, extrapolating linearly to the stu- 
deats of ...ower ability would put the lower .ibility students at a disadvantage. 

ERIC 11*1 
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In case of floor effects, the result of testing will be that gains 
are und^restiiated for individuals with pretest levels of achievetnent .nuch 
lower tkan the level that is needed to barely exceed chance per fonneoce. 
Sone students will even exhibit "negative learning" because of "lucky" gufisses 
on thi pretest. Thus, treatments that ar^ applied to .individuals at ability 
levels ^lower than those for which the achievement pretest l^esigned will be 
nuch less likely to show systematic gains from pretest to Aosttest than treat- 
ments applie>i,-to students in the mldrange for the test (^de Figure 9). 

How fflt^t one detect, sud correct for, floor effects?) Detection is 
ftftrly simple. If there are any scores below the chance level, then some 
floor effects are probably present. Somj students may notc«i.ess, however, so 
their scores even th<Jugh below chance level would not be at the test floor; 
thus, control of guessing (e.g., leneouraging it) is important and, more impor- 
tant, scores sMould be corrected for guessing, taking into account' the number 
of items attempted, in order to identify floor effects. Correction fox floor 
effects is more difficult, so difficult that the use of "out-of-level" tests 
specifically to avoid floor effects, such as used in the Compensatory Reading 
Study (Trismen et al., 1975) is recommended. The problem with choosing a lower 
test level to fit the achievement range of compensatory education participants' 
is that regular students are likely to score at the ceiling of that test and 
comparison usljig two different teats would rely too heavily on the tesc pub- 
lisher's articulation between the levels. 

The issue of ceiling effects is somewhat different from floor effects 
for two reasons. First, the ceiling effects o^cur in the comparison grouR* - 
not the treatment group, in compensatory reading programs; and second, calling 
effects are more clearly observable, since the scores are not contaminated by 
guessing behavior. . The first difference is important because the comparison 
group is taken as the standard against which to compare th^ treatment, and 
that means that model parameters , estimated for the comparison group (as in 
RMC's Model C), will be greatly affected by the ceiling effect. These parameters 
are the average amount of growth in achievement, the variance of growth scores, 
and the correlations between pretest. and posttest scores. The ceiling effects 
will lead to underestimation for the comparison group of avarage gain?, -variance 
of posttest scares, and correlations between pretest and posttest sceres. 
These-?esults of ceiling effects will cause linear extrapolation of the rela- 
tion between pretest and posttest scores from the comparison group to the range 
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of preteit scores.^ obtained by the treatment group to produce larger expected 
gains (i.e., a more difficult criterion) than if the ceiling effect were not 
operating. To deal with this potential problem, the Compensatory Reading 
Study used a quadratic extrapolation, which has not been well investigated, 
but is likely to correct (or partially correct or overcorrect) for the ceiling 
effect. ' 

■The detection of ceiling effects'^ is easy: are" there any perfect scores? 
It would be reasonable to correct for ceiling effects by transforming perfect 
scores upward in order to produce a symmetric distribution, or alternatively, 
to delfc-e from the cooperison group 'jsed^ in the study AjLl students achieving 
a pretest score higlier than the lowest pretest score of a student achieving 
a perfect score on the posttest. This latter procedure could be slightly 
refined to accoiAit for the possibility of achieyj.ng a perfect score by gness-' 
ing at one or more items. In general, such corrections are more reasonable 
for ceiling effects jt?han floor effects, b/ecause the role of guessing is so 
much less at the top the test/^scale; the higher a student's score, the 
less will guessing be a contributing factor to that score. 

The problens. of ceiling and floof effects we have considerexi pertain 
particularly to the case of treatment and comparison groups with unequal 
ability levelf. When both groups suffer frcn identical floor (or ceiling) 
effects, the problems dissolve into the simple problem of overall lack of 
sensitivity, which can be avoided by choosing a different test or test level. 

Finally, there is a substantive problem of grouping^f eedback 'effects. 
This is the set of effects due to different sets of peer interaction.. When 
coiH)ensatory education participants, are in a separate environment, they pro- 
vide an environment for each other that is different from the environment in 
the regular classroom. This effect cannot be "partialed out" to observe the 
true instructional treatment, because in a real sense the selection process 
is part of the total treatment. 

In summary, the purpose of methods for comparing nonequivalent treatment? 
and comparison groups is to make them as sij^ilai^ as "possible so that differ- 
ences in outcome can be attributed to the treatment. The weakness of the 
methods that is most likely to destroy the credibility of conclusions derive^ 
' from such comparisons is the finding of important pretreatment differences 
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between the groups (or even the argument tjiat there ©ay have been such dif- 
ferences) that were not taken into aceount Ih the analyses. Therefore, two 
Important reconmendatlons can l)e made^ 

First, the gjroups should be selected in order to be as similar as p'ossible, 
maximizing the overlap of similar members. In case^ where this is prohibited, 
as in RMC's Model C, the assumption of the analyses that the treatment and con- 
trol groups learn according to the same patterns and principles is highly 
questionable — unless^ as Rubin (1977) points out, the evaluator is reasonably 
certain on the basis of prlpr knowledge that those patterns ate the same. In 
attempting to match groups, some cautioti is necessary, however* If matching 
Is achieved partially because of unreliable enhance variation (e»g», when match- 
ing on a pretest of less than, say, 95Z reliability) so that the match would 
not persist throu^out the evaltiation, then differential regression to the 
mean will confound, tjie analyses* ^Therefore, matching should be made on the 
1)asl8 of reliable measures. 

; 

Second, various sources of difference between tr^tment and coiflparison 
groups should be explicitly noted 'in planning and reporting the study, and 
measureaeut of all potential differences and use of those measurements in 
analyses should be undertaken. 

Given that these reconanendations are followed, then the use of analysis 
of covariance, followed by subsidiary analyses to evaluate the distortion in 
Results due to the nonequi valence of the groups, seems appropriate, if random- 
lied assignment is ruled out. Because of the controversy concerning the 
correction for unreliability of the covariates, that procedure appears ques- 
tionable at present: it should be used only, as by St. Pierre and Ladner 
(1977), in conjunction with uncorrected analyses to determine the possible 
effects of unreliability of the covariate on the results. Improving the' 
reliability of the covariates is preferable; one possibility in the educa- 
tional evaluation area might be to use the gain (posttest minus pretest) as 
the dependent variable and. the sum,of the posttest and pretext scores as a 
more reliable. covai?iate . This would sacrifice, xellability ir ae dependent 
variable, which implies merely a loss of precision in results, in order to 
gain jrellability in the covarate, which reduces the'bias in the results. The 
greater reliability of the covariate derives from its being thp sum of two 
measurements of the same construct, and more information corresponds to 
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greater reliability.* One might worry that this will confound the analyses 
b/ecause the pretest and posttest are both used in calculating the covariate; 
however, if gain is the true variable of interest, then that does. not matter: 
knowing the sum of the pretest and posttest scores tells one absolutely nothing 
about the amount of gain between them (unless floor or ceiling effects are / 
noticeable) . . ^ 

The subsidiary analyses one should plan to carry out when using ANOCOVA 
on nonequivalent groups include at least: (1) estimation of the reUability 
of the covariates; (2) demonstration that, on one or more measures not expected 
to be directly affected by the treatment, partialing out the effects of the 
covariates does *-a. fact eliminate group differences; (3) testing the functional 
form of. the regression equation by fits to scatter diagrams, both visually and 
statistically; and (4) whenever alternative explanations of result* appear 
plaujs-'ble, performing the analyses in different ways in order to demonstrate 
t2ie range of possible conclusions one could reach based on the data. These 
types of analyses have not customarily been catried out, primarily because 
t ay were not planned for; when they have been carried out, they have added 
substantially to the credibility of evaluation resujLts. Therefore, it seems 
iii^)ortant to include p^p^ for such analyses in future evaluation studies. 
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leaue 9* Under what conditions can one infer retapionBhipe of Title J 
goats and ireatmenta to effectiveness? ] 

V * Information on the selection of services that maximize the benefits 
to be derived from various leVels of ^Title I expenditure is, in the long ' 
run, the moat Important information that evaluations can provide. In « 
or4er to gather that information with adqilate- validity to provide the 
basis for widespread selection of treatments, carefully controlled com- 
parlBons involving, true experimental designs are called for. Correlation- 
al data gathered from ongoing projects are subject to great distortion, 
but these are the data most readily available. The discussion of this 
idsue will point cot four kinds of difficulty in making inferences about 
treatment-effectiveness and cost-effectiveness relationships and will 
suggest ways of dealing with the dif f icultiesw 

The four types of difficulty are (1) in identifvini? the contributions 
of Title I, (2) in comi^aring treatments with diff6rent> objectives, (3) In 
identifying what relationship one should study, and (4) in making causal 
inferences from correlational data. Each of these difficulties has played 
a role in the design and outcome of Ti^le Z evaluations. 

The first difficulty, identifying Title I contributions, has two 
sources: the multiplicity of programs designed to meet objectives 
similar to the objectives of Title I and the unintended side effects of 
Title I funds. The first problem is due to the plethora of educational 
programs at the state and federal levels with overlapping goals. While 
one can usually identify compensatory education services fairly readily 
from onsite observation, tracking down whdt components are paid for by 
Title I c^n be well-nigh impossible. Moreover, in m6lt if not all cases. 
Title I Rays only a small portion of the total cost of educating any stu- 
dent, so aphievement gains can only tenuously 1)e related to Title I ser- 
vices Without careful process analysis. The diversity of 'sources for 
educational funds is shown in the surveys by the National Center for 
Educational Statistics (NCES, 1976, 1976). During the 1971-72 school 
year, at least eight different federal progirams provided funds for read- 
ing instruction, with 92% coming frort Title I, and during the 1972-73 
school year, there were at least ten programs, with 85% coming from Title 
I. Thus, even though reading instruction is the subject matter most 
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closely^related to Title I among federal education 'programs, otiier federal 
programs as well as state and local programs supported significant r^a^ing • 
instruction. A report on compensatory education in California in the 1974r 
75 school year (California State Department of Education. 1976) covered - 
three, state programs as well as Title I and found .that there w^r^ -re indlvi 
^ual compensatory reading programs at ea^ch grade level with Title I plus ^ 
other sources of funds than with Title I funds alone. Although California 
18 hardly typical, a quote from the summary of that report '^±11 give an idea 
o| the complexity of divisions of funds from various sources -ind.. various 

services: / i/^ 

, In ECE [State Early Childhood Education f^f ^ • . 

, funds went to pay classified salaries, and 21%. .. for c^^^^ , 

ficated •^ala.ies. In ESEA Title I programs. 43Z of funds were 
used forllassified salaries and 332 for ^^^"f <^ ^^^f 
In EDY programs [Education for disadvantaged Youth]. ^f°l^l_ 
funds went to pay classified salaries, while 71%. . . for certi . 
ficate*" salaries (page 60). . 
Did EDY programs' pay for teachers, and bther progMs" for Support personnel? 
Is thete an accounting propedure^ that makes it simpler for local districts 
to^ assign some funds to -some services and other funds to other services?^ 
Because of the myriad sources pf^ funds for most of . the schpol dls- ' 
tricts that receive Title I funds, it is in fact infeasible to obtain _ 
estimates ot the Title I effects at any reasonable «ost-that is. if what 
is required is an estimate across the nation. The mere fact of the con- 
tinued existence of Title I and its ramifications in tepns of effects on 
tike develc^ent of state compensatory erducation programs and other com- 
pensatory Qucation programs mak'es it impossible at this point, even in • 
theory, to estimate the total Title lyffect in most school districts. On 
• tW other hand, it maybe possible by -an, intense, in-depth analysis .of 
the budgets and services and impact of Title I within a small number of 
school districts to estimate what -actually was the direct Title I effect. 
■ Where Title I contributions are in^tricably mixed as the funds from other 
sources, proportional allocation of the "credit" for benefits would be 
possible. This is one area in which care must be taken not to allow the 
■need for information to interfere with ' optimal use of Title I fur^da. ex. 
cept possibly for a negligible distortion in a few districts randomly 
selected for special stud^'. 
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Of more Interest than isolatibn of Title I contributions may be 
exaalnation of the effects of expenditure variations* on whether compensi*- 
tory education programs of any type work. In fact, for the fxmdamental 
purpose of program ev€^iuation, planning for the future*'^it> is not as' iar 
ppMant to f^ out what the Title I contributipn haa hien as to find 
odl how to direct Title I expenditures to increase the effectiveness 6f 
Other projects in the future, that is, to perform a cost-effectiveness 

Many, si,de effects of Title I funding can-^be imaglij^d, siich as inn 
creasing the number of Jobs tor reading aides in impoverished coiBBuni- 
ties. For the purposes of tvaluatic.T in teribs of children's aclj^eve- 
wmut, however, side effects on childifen^in schoot are most relevant. 
The most salient side effect is likely to be enhancement of th^ s<;ho las- 
tic processes for noncompensatory students; providing special resources 
fpr educacionaliy disadvantaged children will in most cases reduce the 
demands of these children on the regular instructional resourcies (e.g., 
teathets* time) , allowing greater resources to be devoted tp the non- 
compensatory students. Thus, comparisons between compensatory and regular 
treatments are less likely to show the'4)enef its of compensatory education 
than comparisons between matched schools i^r' classrooms in which the "com- 
parison" groifp has#the same membership it would have had if Title I fund^ 
' were not available (i.e. , including educationally disadvantaged children). 

Other relevant side effects to be measi^red in a careful evaluation 
include (1) filtering of effective edmpensatory^ reading methods into the 
regular curriculum, (2) possible stigma associated with participation 
in a compensfiitory treatment, and (3) possible redaction in the effective- 
ness of regular instruction due to allocation of too much of the avail* 
able teaching expertise to the teaching of educatiomally disadvantaged 
children. The assessment of these and'othei side effects requires astute 
onsite observation of the processes occurring during the treatment period. 
Survey data will almost surely be inadequate. 

The second difficulty concerns the multiplicity of objectives of 
Title I projects. The Elementary and Secondary Act of 1965 was intended 
td provide services in the schools that would equ&lize the opportunity 
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of children from LTW-incooe areas to e oy a fulfilling education. Many 
different uses of the money allocated to local districts were attempted., 
Gradually a few distinctive, types of service emerged- as most appropriate for 
Title I expenditures. Table .5 shows a breakdown of expeiiditures taken from 
the NCES survey of tha 1972-73 school year. Clearly, reading and mathematics 
h|ve become central. One might envision a future in which the Title I pro- 
gram is divided into subprograms of math instruction and reading instruction; 
, howevet. there are advant'a.es to comparing the different services within a 
single ^amework as well as advantages to analyzing them separately. 

One reason for making comparisons across different services is to deter- 
mlns which types of service have broader Impact. A service which would 
result in a child's Improvement in several scholastic areas would have 
apparently g;reater utility than' a service that merely Improved performance 
in a single area. One might guess, for example, that compensatory reading 
instruction would have broader impact thail compensatory social studies 
instruction; and if they have impact at all. food, health, anl counseling 
services may have the broadest impact. *To gompare different services, it 
would seem necessary to determine a vector of criteria for achievement and 
other potential outcomes and .to measure gains from a particular type of ser- 
vice on all these criteria. Thus; one could operationalize the guess that 
readinrls broader than social studies by predi^^ing larger combined total 
j^gains A reading, social studies, and mathematics as a result of reading 
. lustration than as a result of social studies instruction. 

There are other reasons for comparing' different, services in the same 
' framework: studies of principles of successful programs in one service area 
' may yield insights into successful metho<^s for other services; critical pre- 
. requisiter,.such as grade level, maturity, and other basic skill achievement, 
may dttecmine when a particular compensatory instrucMon is best conducted; 
and there/may be mutually facilitory. or inhibitory effects of simultaneous 
recepaoLf two or ^re different Title I treatments. Clearly, ananlysis • 
of services alaT^d-at different objectives is worthy of study. 

On the other hand, it is quite reasonable f.r a national evaluation with 
'limited re^'^pa to focus on a single type of service, as the Compensatory 
Reading Stj^r lid, rather than to compare mixtures of different services. 
Data on Z^^^en^atory reading classes is much more likely to yield results 
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Table 5 

Percentage Expenditures for the Title I Low- Income Area 
Support Pr'^sram Qjirtng the 1972-73 School Year 



Direct Services 



Reading (Engl^) 


38Z 




Other English Language Arts 


6Z 




Mathematics (and Natural Science) 


IIZ 




Other Basic Skills 
Other 


IIZ 
IZ 


Support Services 




31Z 


Pupil Services 


lOZ 




Fixed Charges 


8Z ^ . 


Other 


13Z 





Other 



Sou. NCES (1976) 
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of defenaible validity than data on 100 compensatory reading classes, 50. 
co«p«ii8at<5ry mathematics classes, and 50 otlier compensatory treatments. 

The decision of whether treatments with qualitatively different objec- 
tives •hould be included in the same study depends on the particular 
Information needs being satisfied- If a general description of the program 
Is needed, then it seems appropriate to include all treatments, but if 
information on- the effective methods for compensatory education is sought, 
comparisons should be made only between treatments that have common objec- 
tives. 

The third difficulty is in the identitication of the relationship to be 
studied. Although this may seem obvious, it is not. The important infor- 
mation may not be merely that when variable A is increased, so will variable 
B be increased. As a practical example, a controversy around 1972 concerned 
whether there was a "critical mass" of Title I funds that needed to be spent 
on each partici^jant (e.g., $100 per year or $300 per year) in order to have 
an impact on his/her achievement. The implications- of this issue for 
policies of concentrating funds on a few of the most disadvantaged children 
are clear. Although In order to address this question properly, a great 
deal of seconoary resource availability information is required, it can be 
approximated by examination of the relationship of per pupil expenditures 
and achievement across di^ricts. • 

In order to ai-swer tie "critical mass" question, it is necessary to 
determine whether there is some, value of per pupil expenditure such that 
expenditures above that leVel have a far greater effectiveness than expen- 
diture below that level, that is, to determine the point lof maximum increase 
In effectiveness plotted as a function of expenditure, as in Figure 10. 
Tallmadge (1973) merely examined the correlation between expenditures and 
effectiveness to deal with the critical mass question. Of course, his 
finding of almost no correflation suggests that other analyse% would not 
turn up a critical mass, so the other analyses jiay not have been warranted 
for his data. 

Another more general question abou': relational definition, mentioned in 
the discussion cf Issue 1, concerns whether achievement gains are to be 
treated as equally Important across the scale or whether gains which result 
in students' surpassing a :specified proficiency level are to be treated as 
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aost importaut. If a narticular method of compensatory instruction focuses 
on achieving a particular level of achievement for all participants^ts 
production of group average achievement gains is likely to be less €han a 
method that treats all children's gains as equa.Uy important, whether they 
are moderately or severely disadvantaged. 

Consider a concrete example. Suppose in a compensatory class there 
were four students with different learning rates. They required, respectively, 
10 hours. 20 hours. 30 hours, and 40 hours to learn a pai^ticular amount, say 
M. Suppose one teacher allots-lOO hours as follows: 10 hours to the first 
student. 20 to the second. 30 to the third, and 40 to the slowest student. 
Each student would then learn the amount M. Suppose a second teacher allotted 
25 hours to each student. The fastest student would learn an amount equal 
to 2.5 M. the second student 1.25 M. the third student .833 M. and the slowest 
student just 25/40vor .625 M. The average gain under this teacher would be 

(2.5 -I- 1.2*^ + .833 + .625) ^ 
4 

or about 1.3 M. substantially greater than under the more flexible teacher. , 
The point of this example is that focu^ng on compensatory class averages 
instead of. say. class minima, has significant imnlications, for the type of 
process that will be found to be most effective. 

The identification of relations to be assessed in an evaluation depends 
on (1) clear knowledge about the information needed andUhe uses to which it 
is to be put and (2) expertise in translation of verbally stated relations 
Into quantitative calculations. ^ 

The fourth difficulty concerns the inference of c*usal relations from 
correlational data. If the correlation of a particular instructional process 
with achievement., across a' variety of settings, is positive, then the initial 
reaction is that the process, is effective. There are many other possible 
explanations of the correlation, however: other events that may have caused 
both Xhe process to occur and achievement to be high. For example., the 
process may have been employed in districts containing large numbers of 
students who would be likely to make higher than average achievement gains, 
or the occurrence of the process could be merely an indicator of teacher 
expertise or some oth^r underlying factor that, through other processes, 
caused achievement to rise. 
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' The solution for making inferences from correlational data is to have a 
prior, detailed model of the instructional system being observed that includes 
a chato of related events that lead from processer to effectiveness measures. 

of the events in the chain can then be monitored as well as the occur- 
rence of the f»tocess -of interest, and finding the predicted chain of cor- 
relational results that woiild explain the correlation of service with effec- 
tiveness would rule out most alternative e3{t>lanation8 for the correlation. 
The necessity for a" detailed system process model for valid interpretation 
of correlational data cannot be overemphasized. Without such a model, one 
should be highly skeptical of all correlational results of compensatory 
educational evaluations. 

In summary, it is our opinion that inferences concerning relations of 
costs and treatments to effectiveness can be made from surveys and correla- 
tional results, but only if a great deal^of care and preparation precede 
such inferences. ^ Inferences from true experimental designs are much more 
credible, if such designs are feasible. Concerning the other three diffi- 
culties discussed above, (1) the isolation of Title I contributions can be 
very difficult, and for many information needs is not as necessary as 
Identification of compensatory education treatments supported by whatever 
funding sources; (2) direct comparisons of treatments with qualitatively 
different objectives is quejstionable and rarely necessary, although joint 
study of treatments with different objectives may provide useful results 
concerning the generality of processes affected by the treatments; and 
(3) substantially more consideration should be given to the identification 
of just what, relations are to be assessed than has been the case in the 
past. 



I9BUB lO* How should data be aggregated across projects in Title I 
evaluations ? 

The reason for aggregating data across projects Is to provide an 
asMssment of the status of Title I throughout the state, region, or the 
country. This kind of aggregation is clearly necess^^ry for annual repot cb 
to Congress and also for general management policy decisions. On the other 
hand, there are important uses of the local evaluations that do not in- 
volve aggregation beyond the district;. These are uses, for exauiple, to^ 
provide feedback within the district as to what types of services are work- 
ing and how they are working. Thus, it is quite reasonable for a local 
district to gather data and analyze, summarize, and report it in a manner 
that in fact would not allow its being easily aggregated with data from 
other projects in its st^te or in the country* In the past, it h^s been 
customary to attempt to aggregate all of the local evaluation reiports into 
state evaluation reports, which were then aggregated into a national re- 
port to summarize .the impact of Title I projects across the country. 

There are two aspects of this issue to be dealt with. 

1. What are the appropriate units tc?\aggregate across projects? 

2. What is the appropriate system f5r weighting various projects 
during aggregation? 

Major national syntheses of Title I impact (Wargo et al, 1972; Gamel 
et al, 1975; Tt.omas & Pelavin, 1976) have'been built primarily on annual 
state reports, ind an effect of that has been that conclusions were based 
on aggregations of grade-equivaXent scores, those being the units most 
frequently reported by the states* This type of national synthesis is a 
particularly efficient form of national evaluation, because it involves no 
new collection of data; however, the evaluator has no control over the 
collection of these data, ^nd as a result both the. evaluator and his/her 
audience have significant doubts as to the data's validity. In the long 
run, as long as evaluations will be challenged, it is necessary to estimate 
a minimum level of credibility below which the evaluation is useless and 
to select an evaluation strategy to ensure that level of credibilicy. 
Aggregations of reports generated for some other purpose, while quite use- 
ful as corroborative evidence, are dubious as the primary information 
source. In general, one can say that the collection of data from many 
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projects and' their aggregation should follow from Bn examination of in- 
formation needs, and then data collection should be carried out in order 
to satisj/y those needs. Of particular imporcance is the fact that, while 
ev^ry district receiving Title I aid should be carrying out evaluation for 
its ^own purposes, the data needed for a national summary evaluation could 
be supplied by a small random sample of the districts receiving Title I 
fxmds as long as that sarple is selected in an unbiased and representative 
manner. Several studies (USOE, .1970; Glass, 1970; NCES, 1975, 1976; USOE, 
1976) have based national summaries on a sample of districts. 

Let us consider in some detail the, measurement units that should be 
aggregated. Alternative units were discussed under Issue 7. In the past, 
the^ rule has most frequently been %o transform gains observed or scores 
observed in particular projects or particular subjects Into griade-equiva* 
lent gains of month per month and to average these numbers across projects 
In a state and then across states. Although we might argue about the use 
of grade-equivalent scores, it is clearly necessary for aggregation that 
comparable units be entered into the averages for each of the districts 
that are being aggregated. Certainly raw post-test scores or raw gain 
scores would not be appropriate for aggregation unless the same test were 
used throughout the country. But on the other hand, in evaluation studies 
that rio use the same test in all schools, such as in the Compensatory 
Reading Study, (Trismen et al, 1975), it is more reasonable to average 
the raw test scores ^ although normalized standard scores would be prefer- 
able. Whep the scores^ to be aggregated are 'from different levels of a 
particular test, equation for the articulation between the levels must 
take place (e.g., by use of growth scale scores). 

The primary requirements for scores to be aggregatable are (1) thi^t 
they have the sam^ meaning for all cases that are being aggregated and (2) 
that the aggregate score have the same meaning for the aggregate group as 
each score has for the case it represents. Thus, in order to aggregate 
scores on different tests across projects, it is necessary to aggregate a 
derived score that expresses the observed performance relative to some 
expected or national norm performance. Four possibilities are percentile 
gains, grade-equivalent gains, normalized standard score gains, or per- 
centages of students achieving specified objectives. If any one of these 

er|c . ' ^'^^0 
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scores 1^ cdaputed for each individual and then aggregated by an appro- 
priately weighted averaging, it will satisfy the second of the two require- 
ments, if it satisfies the first. Percentile, grade-equivalent, and raw 
gains, however, usually do not ha ire the same meaning for all cases aggre- 
gated, if one assumes that normalized standard scores linearly represent,, 
the underlying achievement dimension: a particular grade equivalent 
"UStiri "Obtalnftl at the low end of the achievement scale, implies a larger 
underlying gain than tWsame grade equivalent gain olAained at hig"her 
achievement levels, ana a given percentile gain represents a larger "real" 
gain at the extremes of the scale than in the middle. The validity of 
the assumption for this argument was questioned and disoussei in Issue 
7. Also the summary of the Compensatory Reading Study .(USOE, 1976) in- 
cludes an appendix that demonstrates that had that study used grade- 
equivalent scores, the conclusions would have been seriously distorted. 
The conclusion arrived at there was that grade-equivalent scores "should 
never be used in educational evaluations" (page 77, emphasis in original). 

Gains in normalized standard scores or /normal curve equivalents are 
especially appropriate for aggregation, hecautttt adding them together, un- 
like, other alternatives, does not change rheir '^statistical properties: 
the aggregate score is also normally distributed. Finally, percentages 
of students achieving specified ob4ective8 must be properly weighted to 
be .meaningfully aggregated, and, the proper fl*eighting is equivalent to 
adding numerators and denominators together separately to obtain an over- 
all percentage (e.g., 4 out. of 5 in one project [80Z] plus 5 out of 10 
in another project (50Z) yields a total of 9 out of 15 [60Z]). 

One further note: it is usually not meaningful to transform aggre- 
gated units of one type to another type of unit in order to perform fur- 
ther analyses. For example, one might consider transforming the mean 
grade-equivalent gains reported in annual state Title I evaluation re- 
ports into mean normalized standard scores in ofder to aggregate across 
states. Theoretically, one could use standard test publishers' tables 
to make the transformation. However, this transformation would be 
meaningless, primarily because of the nonllnearity of each derived score 
as a function of raw scores. The mean of a group of percentile scores 
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is not g«n«rally equal to the percentile of the mean of tfheir raw scores, 
siallarly for grade-equivalents and normalized standard scores. Once 
one has selected a particular ineasureioent unit and performed one level 
of aggregation^ (e.g., calculated a mean), further analysis and aggrega- 
tion 'must be in terms of that unit in order to be valid. 

' Let us consider, now, the problem of weighting the results from var- ^ 

. lous projects in determining an aggregate sunnary value. Weighting is 
s method to obtain representative unbiased estimates of populati9n values 
leven though one has a sample with known biases. As mentioned in discussing 
tjie various methods in the. introduction to the Sampling Section, one can 
use stratifilid sampling, sample with different sampling proportions from 
each of 'the strata proiiucing a biased sample, and then recomblne the data 
using weights to eliminate the bias. This was done, for example, in the 
CPlR surveys (NCES. 19765, 1976). 

The reasons for wmpllng in different ratios from various strata 

are. (1) the need for equal precision of estimates In strata of different 

sizes, (2) differences in.^the cost of collecting drta from different 

strata, and (3) effects of sampling units. If one stratum contains 200 

schools and another 800 schools, and if one is planning to use a sample 

-of 50 schools both primarily to test for differences between the two 

strata and secondarily to provide an overall population estimate, then, other 

things equal, he/she should select 25 schools from each stratum, not the 

10 schools in one stratum and 40 schools in the other stratum needed for 

representativeness. The population estimate can still be obtained by 

weighting the schools in the second stfatum by four times as much as 

those in the first stratum Ceach sampled school in the second stratum 

represents — 3? - 32 schools in the population, whereas in the first 

25 200 ' 

stratum each sampled school represents -"25 " ^ schoola in the population 

and 32 • 4x8). 

Different selectidn ratios based on cost are most noticeable in the 
follow-up of nonrespondents. Cobts may be 10 or even 50 times as great 
per case in the stratum of nonrespondents as in the stratum or respondents. 
Thus, the benefit from finding all nonrespondents will rarely justify 
t he costs. Texts on sampling theory (e.g.. Raj, 1968) provide formulas 
for optimal tradeoffs of cost and precision as a function of one's needs 
for precision. 

i 32 
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The third reason for weighting is to reconstruct one population from 
a sample from another population. For example, if^ mean achievement levels 
are available from state reports, they can be used to produce national es- 
?^ifflates by weighting each state's achievement level by the number of stu- 
dents in th^ state* 

Briefly, to be explicit, weighting means multiplying each sampled 

unit's score by the number of units in the population it represents, when 

calculating means, standard deviations, and so on. In the example of * 

differential sampling from two strata discussed above, if the mean number 

jof students in schools in the first stratum is 150 and for schools in the 

second stratum it is 300, then the unbiased estimate for the mean far the 

population of 1,000 schools is not -^^^ . 225 -but rather 

200(150) t 800(300 ) ^ 
200 + 800 

Use of weights, while producing unbiased or nearly unbiased estimates 
of average values (estimates that tend to be the same as the population 
value in the long run), also reduces the effective sample size. For the 
example, the 50 schools produce a weighted estimate of the mean with a 
standard error equal to an unweighted sample of 37 schools.* Thus, care 
must^^e taken not to be too extreme in use of differential weighting in 
stratJfied sampling. It should be apparetft also that appropriate weight- 
ing is impossible if the differential selection ratios are not known. 

In summary, the most important problems for aggregation are (1) to 
ensure that throughout the aggregation process the same measurane^t units 
are aggregated and (2) to ensure that the knowledge of different stratum 
selection ratios is available for use in weighting results appropriately. 
The measurement unit that is subject to the fewest criticisms appe^^ to 
be the normalized standard score unit (one example of ^hich is the^iM- 
mal curve equivalent). In any case, in performing an aggregation using 
any unit and weighting procedure, %n analyst needs primarily to address 
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* If each of n sampled units, u^, has a weight, W^, the effective sample 
size is ^ E ^ i j '^^l ? ( ^ i ^ y ^""^ case of equal 
weights throughout, this is equal to n; otherwise it Is less than n. 



127 



the questions of whether the aggregate score means the same thing for 
the aggregate group as each ^^^ividual member's score means for the in- 
dividual member and whether a particular score means the same thing for 
each"ma^er who might obtain it. 
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c * 

Sutamary 

In this document, we have attempted* to answer the question of "What has 
bccA learned about evaluation methodology from the decade of compensatory 
education?" During that decade, tens of millions of dollars- have been spent on 
educational evaluation, and partly because of the political significance of t_he 
'Information produced by the studies, substantial efforts have been undertaken 
to idetttify the methodological problems that can undermine the validity of 
evaJLuation. From the resulting discussions and controversies, which can be. 
expected to continue, the most positive outcome l*a3 been the recognition of the 
need for further development of evaluative expertise and the expeuditure of ^ . 
effort by capable researchers to satisfy that need* The recommendations for 
jevaluation methodology made previously in this document and reiterated in this 
section are not merely those of the authors, but ^ther the authors* inte?:- 
pretations of recommendations made by a large number of researchers in this 
field. Although many of the recomendations remain controversial in 1977, most, 
ve believe, reflect the general consensus among expert evaluators that greater 
efforts must be made to gather leas^ Information more validly. 

We deliberately avoided defining "evaluation" explicitly in this document; 
because to do so in any useful way wbuld preclude from consideration studies 
that are only tangent ial(j.y evaluative, in this case, of compensatory education. 
Rather, we focused on the methodology of information gathering, noting that 
the use of information to test rationales for decisions is common motivation 
for its being gathered and an important determinant of decisions concerning 
methods to be used. The issues discussed pertain to four phases of information 
gathering: design, sampling, measurement, and analysis. 

Design 

The two design issues discussed did not compare experimental, quasi- 
experimental, and pre-e^xperimental designs at great length, as was adequately 
done by Catopbell and Stanley (1963) . They focused instead on two more global 
problWa: (D whether quasi-experimental designs could be feasibly and what 
alternatives to quasi-experimental designs might be appropriate for compen- 
satory education evaluation; and (2) whether conditions called for longitudinal 
data collection paradigms. The major recommendations made concerning design 
are the following. 

jc 
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atcoMMnilitlon 1 . Future evauations of the Impact of coittpeMetory. ■• 
Mncdtion should lfi<kude conparisons of Wticipatlng children's achieveiMiit 
«|«tut a priori, or absolute, standards of expected achievement as wel^ as^ 
or Instead of. relative comparisons agalnsk the performance of statistically 
equated comparison groups. 

ReeoiiBwndation 2 . When evaluations m&st provide Information baied on 
comparisons between groups, greater effort should be irfde to find ways of 
sheeting and assigning students to these groups randomly, so -that the many 
problem^ with statist&al equating can be avoided. Several methods for lil- 
creaslng the political feasibirity of randomization were discussed. Ucop- 
mendations f<w proceeding when a relatlSre comparison against a -nonequlvalent 
comparison group is mandatory are discussed in the section on an^ysls; 

Recoiiaendation 3 . Individual student achievem«it gains should be meas- 
ured for intervals of whole years to avoid distortions that occur from testing 
twice in the same classroom setting; fali-to-s^ring gains usually greatly 
overestlmatt gains observed over whole year periods. ' . , 
Raftomendation 4 . Conclusions based on pretest-posttest gains should not 
be compared to published norms without taking Into account that the ^chUdren 
being assessed Ixe taking the tesi (in parallel forms) twice, whereas the norm 
group took the test'only once, and other teat administration artifacts. 

Recom^ndation 5 ; Teachers' retrospective judgment of children's gains 
should be disregarded for the purposes of program evaluation; however, teacher^ 
observa^ioAS^ recorded during a treatment period can be valuable. 

Recommendation 6 . long-term longitudinal studies, making use oi over- ^ 
lapping cohorts where possible, are necessary, for ultimate impact evaluation 
oif Title 1. 

Recomendation 7 . As a corollary^ any evaluations of TitU I undertaken 
without funding for long-term longitudinal data collection should nevertheless 
take inexpensive steps to ensure that tb- data base can Idter be used as the 
first stage of a longitu41nal study. 

These recomendations aie made because it is the authors' belief that 
they would contribute to the improvement of the. effectiveness vith wHich 
education evaluation funds at* spent. That they are not completely novel is 
evideticed by the fact that the design of the current Sustaining Effects Study. 
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\b«liMl carriod out_by^4tem Development Corporation for the U.S. Office of 
Bducation ccnforM to t\iem more closely thaa did e.;rlier studies. 

The two issues dealing with selection of projects or other units for 
obeervation tUat were discussed are substantially less controversial than the 
other issues in this documeut, possibly because of the ease of finding coia- 
■ptoaise solutions (e.g.-, ^ -uiediuni. sized ssmpU) as well as. because the theory 
of MBpling is quite extensively developed. The issiies discussed relate to 
ehe eipects of representativeness and size of samples. The following are the 
major recommendarions that we believe should be made on these topics. 

RecoMBendation 8 . The use of .uantitatively representative samples 
ahould be limited to instances where the information need is for quantitative 
astljutes of progr. -ft operating characteristics; in other cases, such as testing 
hypotheses about relationships, other sampling methods are more efficient. 

Raconmendation 9 . The needs for data analysis should be considered in 
deciding upon the primary sampling units, and great cautioa should be used in 
drawing inferences about units other than the primary sampling uniti . ' Although 
v^ld inferences about, student processes can T made when the primary sampling 
unit is the classroom, it is also very easy to make invalid inferences in that 
situation. 

Racocmendatiou ro. Although there are methods for explicitly deriving 
needed sample sizes from information precision requirements, the ralue of 
precision Of informacion for testing decision ration es is as yet only vaguely 
understood, so within broad limits the increased costs fot large samples may 
be better spent on more careful study, and therefore more valid information, on 
smaller samples. / * 

The main theme of these th^ee recommendations is that sampling plar^ can- 
not be developed independent;y from other aspects of. information gathering. 
Greater flexibility in sampling strategies .han has. been the custom in compen- 
satory education evaluations is called for. 

tfeasureipenc 

The discussion of measurement issues was l^ited to the measurement of 
Impact on children, primarily on their cognitive achievement. The validity 



of ■Mi«ur««»nt tut uaderjioM the noat severe scrutiny of any of the processes 
In evaluAtions of coap^tory educaf.ion. possibly because the ways in which 
wasur«M&t can distort reality are tsore generally understandable than the 
my* In iihich sampling or analysis c*.a distort reality, or possibly because 
o£ the fact that different ethnic groups obtain different average scores on 
cotnitive achievement tests. The three levels of issue concerning measarc- 
Mnt, which provided the structure for that section of the document, are 
(I) selection of constructs to measure. (2) choice between non»-referenced 
and criterion-referenced tetts. and (3) selection. of measurement units In 
which to record test T rformance. The major measurement recommendations made 
are the following. 

•. Racomaendation 11 . Until more is known about the. relations between 
tsoncognltive and cognitive gains, mieasuref of nocr- nltive gains^should be 
used only as supplements to measures of cognitive ? < iS u^/thTivaluation of 
cospeasatory education Impact. 

Recommendation 12 . Until more is known about the relations of componeut 
sklUs (e.g.. decoding, memory) to overall skills (e.g.. treading ability), 
measures of the component skills should be used only as supplements to meas- 
ures of overall skills in compensatory education evaluation. 

Recomendatio n 13. Achievement data in compensatory education evaluation 
should be interpreted in terms of models of cognitive growth processes. In 
order for this to occur, further research on basir. skills i. necessary, and 
the results of that research and existing research must be adapted for use in 
•valuation studies. 

Recomendatlon 14 . Norm-referenced tests should not be used in program 
eval.iation unless the evaluator takes into account the problems in using those 
tests (eight prob:.^ are discussed in this document); in any case, using 
published norms as the "comparison group" in a relative comparison is highly 
questionable. 

Recommendation 15 . Criterion-reference. d tests should be seriously con- 
sidered for use in program evalution, the most difficult problem to be solved 
in their use in large scale evaluations is how to aggregate results related 
to different local treatment objectives. 

, Recommendation 16 . Test publishers should be encouraged in theit efforts 



133 

to prorldft t«stt that are both explicitly crlterion-refereaced and also norn-* 
r«fereiiced~theae attributes do not ccnflict. 

Haco— nidation 17 * Achievement test scores should always be corrected 
for gueaalng when used In program evaluation, based on the nui^ber o£ items 
each student attempted. This recommendation is made even though It virtually 
eliminates the possibility of evaluation based on comparing scores on published 
tests with norms tables. 

ascommetydatlon 18 . Because of the great heterogeneity of skill levels 
assessed in compensatory education evaluation, standardized tests sensitive to 
substantially wider ranges of ability level should be developed; these may 
require branching processes or differential wrong- response scoring In order to 
be efficient. 

Recommendation 19 . Especially when analyses are to be done that assume a 
nomal distribution of scores, but also in other cases, scores should be trans- 
lated to normalized scores (e.g., normal curve equivalents) as preparation for 
analysis. 

Recommendation 20 . Multivariate analysis of vectors of proficiency or 
mastery scores on sets of componert skills should be given seriouo consider- 
ation for program evaluation. 

Recommendation 21 . Crade-equivalent scores should be avoided. 
Analysis 

The analytical issues in compensatory education evaluation have drawn the 
greatest Interest of theoretical methodologists. Dealing with these Issues 
provides a useful direction for methodological research, which is also intel- 
lectually intriguing. Although three analytical issues were discussed in this 
docu-.^t, by far the major interest ha& been In the first — how tp compare the 
performance of a priori nonequivalent treatment and control groups so that 
differences can be attributed to the treatment. The oth€^r two issues discussed 
concern the inference of relations (e.g., between treatment processes and ef- 
fectiveness) from correlational data and the aggregation ot data across higher 
level sampling unitf . The major recommendations we make on these three issues 
%re the following. 

Recotaaendation 22 . Without resorting to unreliable measures, treatment 
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^ co«pTi«,n groups should be selected to- be as similar as possible, even 
wfacn they cannot be randomly assigned. 

, ...M^n 23. A comprehensive consideration of potential differ- 
ences between treatment and control groups (prior to treatment) should be a 
p.rt of evaluation planning and measurements of potential differences between 
groups on variables relate to performance should be undertaken. 

«,c.o^ndation 24. Uncorrected, straightforward analysis of covariance 
ia a reasonable method for carrying out comparisons of nonequivalent groups, 
but only if supplemented by subsidiary analyses that investigate among other . 
things: (1) the reliability of covariates. (2) the residual nonequlvalence 
after partialing cut the effects of covariates. (3) the functional form of 
the regressi9n function, and (4) the change in conclusions that would result 
if any major untestable assumptions were violated. 

R^co^aendation 25 . Whenever causal relational inferences are to be made 
fro. quasi-«cperimental or correlational data, a system model that includes a 
chain of events that underlies the relation is required, and measurement of at 
least a subset of the intervening variables is necessary to rule out alter- 
native explanations of the correlation. 

^,,o,sen^^2e. If scores are to be aggregated across different units 
(e.g.. districts, states, ar regions), it is essential that the same measure- 
ment unit be used in all cases; if the statistics are ^ noncomparable units, 
summaries of summary statistics cannot be made meaningful by statistica.1 
manipulation. 

Reco-«d.tion 27. Infomatlon about sampling ratios In different strata 
„st be used in order to obtain unbiased total population astlmates using 
differential stratum weights. 

AB mentioned before, these recommen--atlons range fro. obvious to 
controversial, depending on the reader's viewpoint. Any att«^t,at '^^the.ls^ 
ILh this IS. cannot deplore the details of any particular Issue as thoroughly 
a. would an Investigator who focused his or her efforts on a single Issue; at 
.„.. point in the not too distant future, many of the Issue, will be substan- 
tially clarified because of the focused efforts of qualified methodologlsts. 

in addition - the limitation In thoroughness, ^hls document Is limited 
tn bre^ith in that not all of the methodological lasues potentially relevant 
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to co^«M«tory •ducation evaltiatlon could be discussed. ' Omissions we feel 
■ost unhappy about include a discussion of the Bayesian approach to data 
aaal^yala, a presentation of quantitative methods for assigning values to 
ptpgrvi outcoaes, an exploration of alternative concepts of basic skills 
d«7«lopitent. a consideration of the external validity of laboratory experi- 
wncs. and a discussion of issues related to program cost estimation. The 
isauas discussed in this dociiment are, however, the most critical method- 
ological issues for TJtle I evaluation, in our opinion. 

In conclusion, the state of the art In educational eval.^tion has changed 
d^ticaily from the situa on ten years ago when the tEMPO study (Mosfaaek, 
1968) set out to test policy rationales by estimating linear regression 
coefficients. Much of the effort In that decade has shown the need for 
further effort to develop evaluation methodology to a level that researchers ^ 
and policymakers will both find pleasing. New compromises must be found Where 
conflicting values preclude simple solutions (e.g., randomized designs). A 
primary purpose of this document has been to suggest a few paths to follow 
In searching for those compromises. 
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